How to use ModernBERT as a sentence transformer?

#9
by hungrybiker - opened

How to do the equivalent of this with ModernBERT ? I know its not a sentence transformer and AutoModel, BERTModel doesnt recognize this correctly.

sentences_1 = 'How is the weather today?'
sentences_2 = 'What is the current weather like today?'

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/msmarco-MiniLM-L-12-v3',trust_remote_code=True)
embeddings_1 = model.encode(sentences_1, normalize_embeddings=True)
embeddings_2 = model.encode(sentences_2, normalize_embeddings=True)

print(cos_sim(embeddings_1, embeddings_2))

The base model has not been fine-tuned for retrieval tasks off-the-shelf.

I have a sentence-transformer finetuned version based on the offical finetuning script with a larger batch size. It's working better than the numbers reported in the paper. Please give it a try and share your experience!

https://huggingface.co./joe32140/ModernBERT-base-msmarco

@hungrybiker https://huggingface.co./blog/train-sentence-transformers may be able to help :) Do note that you need to install transformers from main using

pip install --upgrade git+https://github.com/huggingface/transformers.git

The base model has not been fine-tuned for retrieval tasks off-the-shelf.

I have a sentence-transformer finetuned version based on the offical finetuning script with a larger batch size. It's working better than the numbers reported in the paper. Please give it a try and share your experience!

How have you done this? In the train_st.py script it loads model = SentenceTransformer("answerdotai/ModernBERT-base") but this does not exist, or based on what do you finetune?

Hi @WoutDeRijck ,

As mentioned in @Xenova 's comment, you need to install latest dev version of the transformers package by pip install --upgrade git+https://github.com/huggingface/transformers.git.

@joe32140 , no it's not the problem of modernbert on transformers, I get the problem with SentenceTransformers "No sentence-transformer model named answerdotai/ModernBERT-base ...."

@WoutDeRijck

I see. In that case, the problem is the same as I suggested, where ModerBERT-base is not trained for sentence transformer.

The message you got is not an error message but a warning to make sure you are aware that this model does not directly work for sentence transformers. You could go ahead and finetune your own sentence transformer based on answerdotai/ModernBERT-base.

@joe32140 , no it's not the problem of modernbert on transformers, I get the problem with SentenceTransformers "No sentence-transformer model named answerdotai/ModernBERT-base ...."

Hi, I'm having the same problem. @joe32140 I don't understand how I could go ahead.

@guid02

I got the same message, but it won't stop your fine-tuning script. In my understanding, it's just a message telling you that the model checkpoint is not directly built from sentence transformers. Are you facing the issue of the script stopping at this line?

In [3]: model = SentenceTransformer("answerdotai/ModernBERT-base")
No sentence-transformers model found with name answerdotai/ModernBERT-base. Creating a new one with mean pooling.

@joe32140 And how do you finetune this? Because following the examples/train_st.py for example, does not work off the shelf... Thanks for the quick response!

This is the error message

Traceback (most recent call last):
  File "/root/Documents/BERT/ModernBERT/examples/train_st.py", line 94, in <module>
    main()
  File "/root/Documents/BERT/ModernBERT/examples/train_st.py", line 71, in main
    dev_evaluator(model)
  File "/root/Documents/BERT/venv/lib/python3.10/site-packages/sentence_transformers/evaluation/TripletEvaluator.py", line 194, in __call__
    positive_scores, negative_scores = similarity_functions[fn_name](
  File "/root/Documents/BERT/venv/lib/python3.10/site-packages/sentence_transformers/evaluation/TripletEvaluator.py", line 174, in <lambda>
    paired_cosine_distances(anchors, positives),
  File "/root/Documents/BERT/venv/lib/python3.10/site-packages/sklearn/utils/_param_validation.py", line 216, in wrapper
    return func(*args, **kwargs)
  File "/root/Documents/BERT/venv/lib/python3.10/site-packages/sklearn/metrics/pairwise.py", line 1307, in paired_cosine_distances
    X, Y = check_paired_arrays(X, Y)
  File "/root/Documents/BERT/venv/lib/python3.10/site-packages/sklearn/metrics/pairwise.py", line 263, in check_paired_arrays
    X, Y = check_pairwise_arrays(X, Y)
  File "/root/Documents/BERT/venv/lib/python3.10/site-packages/sklearn/metrics/pairwise.py", line 209, in check_pairwise_arrays
    Y = check_array(
  File "/root/Documents/BERT/venv/lib/python3.10/site-packages/sklearn/utils/validation.py", line 1107, in check_array
    _assert_all_finite(
  File "/root/Documents/BERT/venv/lib/python3.10/site-packages/sklearn/utils/validation.py", line 120, in _assert_all_finite
    _assert_all_finite_element_wise(
  File "/root/Documents/BERT/venv/lib/python3.10/site-packages/sklearn/utils/validation.py", line 169, in _assert_all_finite_element_wise
    raise ValueError(msg_err)
ValueError: Input contains NaN.

@joe32140
Yes, same thing, but then I get very bad results on the default baseline:
No sentence-transformers model found with name answerdotai/ModernBERT-large. Creating a new one with mean pooling.
dim_1024_cosine_ndcg@10: 0.034502711430959544
dim_768_cosine_ndcg@10: 0.03088882109036852
dim_512_cosine_ndcg@10: 0.020989891801909726
dim_256_cosine_ndcg@10: 0.03654070604182175
dim_128_cosine_ndcg@10: 0.04005149885939249
dim_64_cosine_ndcg@10: 0.03745167403718091

I'm trying to evaluate the model on a default baseline first, in order to see if I can improve these scores by fine-tuning on my specific dataset

The base model has not been fine-tuned for retrieval tasks off-the-shelf.

I have a sentence-transformer finetuned version based on the offical finetuning script with a larger batch size. It's working better than the numbers reported in the paper. Please give it a try and share your experience!

https://huggingface.co./joe32140/ModernBERT-base-msmarco

@guid02

As I mentioned above, the baseline has not been trained for retrieval tasks. Can you try my fine-tuned checkpoint (joe32140/ModernBERT-base-msmarco) to see if it's better?

@joe32140
I'll definitely try, however I'm using ModernBert-large

@joe32140 And how do you finetune this? Because following the examples/train_st.py for example, does not work off the shelf... Thanks for the quick response!

This is the error message

Traceback (most recent call last):
  File "/root/Documents/BERT/ModernBERT/examples/train_st.py", line 94, in <module>
    main()
  File "/root/Documents/BERT/ModernBERT/examples/train_st.py", line 71, in main
    dev_evaluator(model)
  File "/root/Documents/BERT/venv/lib/python3.10/site-packages/sentence_transformers/evaluation/TripletEvaluator.py", line 194, in __call__
    positive_scores, negative_scores = similarity_functions[fn_name](
  File "/root/Documents/BERT/venv/lib/python3.10/site-packages/sentence_transformers/evaluation/TripletEvaluator.py", line 174, in <lambda>
    paired_cosine_distances(anchors, positives),
  File "/root/Documents/BERT/venv/lib/python3.10/site-packages/sklearn/utils/_param_validation.py", line 216, in wrapper
    return func(*args, **kwargs)
  File "/root/Documents/BERT/venv/lib/python3.10/site-packages/sklearn/metrics/pairwise.py", line 1307, in paired_cosine_distances
    X, Y = check_paired_arrays(X, Y)
  File "/root/Documents/BERT/venv/lib/python3.10/site-packages/sklearn/metrics/pairwise.py", line 263, in check_paired_arrays
    X, Y = check_pairwise_arrays(X, Y)
  File "/root/Documents/BERT/venv/lib/python3.10/site-packages/sklearn/metrics/pairwise.py", line 209, in check_pairwise_arrays
    Y = check_array(
  File "/root/Documents/BERT/venv/lib/python3.10/site-packages/sklearn/utils/validation.py", line 1107, in check_array
    _assert_all_finite(
  File "/root/Documents/BERT/venv/lib/python3.10/site-packages/sklearn/utils/validation.py", line 120, in _assert_all_finite
    _assert_all_finite_element_wise(
  File "/root/Documents/BERT/venv/lib/python3.10/site-packages/sklearn/utils/validation.py", line 169, in _assert_all_finite_element_wise
    raise ValueError(msg_err)
ValueError: Input contains NaN.

@WoutDeRijck

I can finetune the model without this error. FWIW, I am using Python 3.11:

datasets                    2.21.0
scikit-learn                1.6.0
sentence-transformers       3.3.0
transformers                4.48.0.dev0
pytorch-ranger              0.1.1
torch                       2.4.0
torch-optimi                0.2.1
torch-optimizer             0.3.0
torchaudio                  2.4.0
torchmetrics                1.4.0.post0
torchvision                 0.19.0

@joe32140
I'll definitely try, however I'm using ModernBert-large

@guid02 I plan to fine-tune a large version. Will keep you posted!

@joe32140 , I made an environment like yours, but still get the same error. I am doing this locally with an Nvidia RTX 3060 GPU

I resolved the previous error installing correct version of flash attention pip install "flash_attn==2.6.3" --no-build-isolation. However, I am running into a new error.

(venv) root@OmenvanWout:~/Documents/BERT/ModernBERT# python3 examples/train_st.py 
No sentence-transformers model found with name answerdotai/ModernBERT-base. Creating a new one with mean pooling.
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in ModernBertModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Resolving data files: 100%|██████████████████████████████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 40.85it/s]
Loading dataset shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 17/17 [00:02<00:00,  5.79it/s]
/root/Documents/BERT/venv/lib/python3.11/site-packages/torch/_inductor/compile_fx.py:150: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(
  0%|                                                                                                                   | 0/2442 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/root/Documents/BERT/ModernBERT/examples/train_st.py", line 94, in <module>
    main()
  File "/root/Documents/BERT/ModernBERT/examples/train_st.py", line 82, in main
    trainer.train()
  File "/root/Documents/BERT/venv/lib/python3.11/site-packages/transformers/trainer.py", line 2163, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/root/Documents/BERT/venv/lib/python3.11/site-packages/transformers/trainer.py", line 2523, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/Documents/BERT/venv/lib/python3.11/site-packages/transformers/trainer.py", line 3703, in training_step
    loss /= self.args.gradient_accumulation_steps
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

I guess there is still a version mismatch somewhere

This comment has been hidden

@WoutDeRijck . Yeah, that seems to be a pytorch version problem. I would suggest rebuilding the environment from scratch.

Got same problems running this on Google Colab T4 or A100. Can you maybe describe how you installed your environment @joe32140 ?

Answer.AI org

Are you using gradient accumulation steps in your script? Perhaps you can bypass it by removing them.

@tomaarsen , I am just using the examples/train_st.py script that is available on modernbert github repo. But I'll try and add this to the args

@guid02 Here you go! https://huggingface.co./joe32140/ModernBERT-large-msmarco . I have some promising results for evaluations for the large one. I will add evaluations to the model card later today.

@WoutDeRijck @tomaarsen
I think I found the reason. There might be new patches to transformers that break the codebase. You need to rebuild from the earlier commit.

pip uninstall transformers
pip install git+https://github.com/huggingface/transformers.git@f42084e6411c39b74309af4a7d6ed640c01a4c9e

@tomaarsen @NohTow @bwarner I might not have time to find the root cause, but it seems to be a major issue in the latest version.

Answer.AI org

If anyone has a moment for it, I would really appreciate it they'd make an issue on transformers. Then I can have a look at it as soon as I'm available again.

  • Tom Aarsen

Seems to be working up to that commit. Just a question:

Loading dataset shards: 100% 17/17 [00:00<00:00, 65.01it/s]
  0% 3/2442 [00:43<9:43:19, 14.35s/it]

Is this normal that this is taking this long? Why? (All running on Google Colab)

@tomaarsen , I made an issue to describe the problem.

https://github.com/huggingface/transformers/issues/35407

@guid02 Here you go! https://huggingface.co./joe32140/ModernBERT-large-msmarco . I have some promising results for evaluations for the large one. I will add evaluations to the model card later today.

@joe32140 thank you! Gonna try today!

@joe32140
I get an error with the compiler while using your models:
File "C:\Users\User.DT-SALASERVER\AppData\Local\Programs\Python\Python311\Lib\subprocess.py", line 413, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\bin\Hostx64\x64\cl.exe', 'C:\Users\USER1.DT-\AppData\Local\Temp\tmpss6hj4zq\main.c', '-O3', '-shared', '-lcuda', '-LC:\Users\User.DT-SALASERVER\Desktop\AI CSV\AI Proj 1\venv1\Lib\site-packages\triton\backends\nvidia\include', '-LC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\include', '-LC:\Users\USER1.DT-\AppData\Local\Temp\tmpss6hj4zq', '-LC:\Users\User.DT-SALASERVER\AppData\Local\Programs\Python\Python311\Include', '-IC:\Users\User.DT-SALASERVER\Desktop\AI CSV\AI Proj 1\venv1\Lib\site-packages\triton\backends\nvidia\lib', '-IC:\Users\User.DT-SALASERVER\AppData\Local\Programs\Python\Python311\libs', '-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\lib\x64', '-o', 'C:\Users\USER~1.DT-\AppData\Local\Temp\tmpss6hj4zq\cuda_utils.cp311-win_amd64.pyd']' returned non-zero exit status 2.

Seems to be working up to that commit. Just a question:

Loading dataset shards: 100% 17/17 [00:00<00:00, 65.01it/s]
  0% 3/2442 [00:43<9:43:19, 14.35s/it]

Is this normal that this is taking this long? Why? (All running on Google Colab)

@WoutDeRijck You can try to have a smaller mini_batch_size. I feel like sometimes when the GPU vram is not enough, the library will use some kinds of offloading, which significantly affect training time. It only takes me an hour on my RTX4090 for finetuning base model with 1.25M training instances.

@joe32140
I get an error with the compiler while using your models:
File "C:\Users\User.DT-SALASERVER\AppData\Local\Programs\Python\Python311\Lib\subprocess.py", line 413, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\bin\Hostx64\x64\cl.exe', 'C:\Users\USER1.DT-\AppData\Local\Temp\tmpss6hj4zq\main.c', '-O3', '-shared', '-lcuda', '-LC:\Users\User.DT-SALASERVER\Desktop\AI CSV\AI Proj 1\venv1\Lib\site-packages\triton\backends\nvidia\include', '-LC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\include', '-LC:\Users\USER1.DT-\AppData\Local\Temp\tmpss6hj4zq', '-LC:\Users\User.DT-SALASERVER\AppData\Local\Programs\Python\Python311\Include', '-IC:\Users\User.DT-SALASERVER\Desktop\AI CSV\AI Proj 1\venv1\Lib\site-packages\triton\backends\nvidia\lib', '-IC:\Users\User.DT-SALASERVER\AppData\Local\Programs\Python\Python311\libs', '-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\lib\x64', '-o', 'C:\Users\USER~1.DT-\AppData\Local\Temp\tmpss6hj4zq\cuda_utils.cp311-win_amd64.pyd']' returned non-zero exit status 2.

@guid02 The problem does not seem to be relevant to transformers. Could you run it in WSL instead? I have no experience running it using Windows OS.

Sign up or log in to comment