Tom Aarsen
AI & ML interests
Articles
Organizations
tomaarsen's activity
⛏ Hard negatives are texts that are rather similar to some anchor text (e.g. a query), but are not the correct match. They're difficult for a model to distinguish from the correct answer, often resulting in a stronger model after training.
mine_hard_negatives
docs: https://sbert.net/docs/package_reference/util.html#sentence_transformers.util.mine_hard_negatives🔓 Beyond that, this release removes the numpy<2 restriction from v3.1.0. This was previously required for Windows as not all third-party libraries were updated to support numpy v2. With Sentence Transformers, you can now choose v1 or v2 of numpy.
Check out the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.1.1
I'm looking forward to releasing v3.2, I have some exciting things planned 🚀
Training a SetFit classifier model consists of 2 phases:
1. Finetuning a Sentence Transformer embedding model
2. Training a Classifier to map embeddings -> classes
🔌The first phase now uses the SentenceTransformerTrainer that was introduced in the Sentence Transformers v3 update. This brings some immediate upsides like MultiGPU support, without any (intended) breaking changes.
➡️ Beyond that, we softly deprecated the "evaluation_strategy" argument in favor of "eval_strategy" (following a Transformers deprecation), and deprecated Python 3.7. In return, we add official support for Python 3.11 and 3.12.
✨ There's some more minor changes too, like max_steps and eval_max_steps now being a hard limit instead of an approximate one, training/validation losses now logging nicely in Notebooks, and the "device" parameter no longer being ignored in some situations.
Check out the full release notes here: https://github.com/huggingface/setfit/releases/tag/v1.1.0
Or read the documentation: https://huggingface.co./docs/setfit
Or check out the public SetFit models for inspiration: https://huggingface.co./models?library=setfit&sort=created
P.s. the model in the code snippet trained in 1 minute and it can classify ~6000 sentences per second on my GPU.
Glad to hear it! Feel free to send over feedback if you have any, it's always quite valuable for new features/docs.
⛏ Hard Negatives Mining Utility: Hard negatives are texts that are rather similar to some anchor text (e.g. a question), but are not the correct match. They're difficult for a model to distinguish from the correct answer, often resulting in a stronger model after training.
📉 New loss function: This loss function works very well for symmetric tasks (e.g. clustering, classification, finding similar texts/paraphrases) and a bit less so for asymmetric tasks (e.g. question-answer retrieval).
💾 Streaming datasets: You can now train with the datasets.IterableDataset, which doesn't require downloading the full dataset to disk before training. As simple as "streaming=True" in your "datasets.load_dataset".
🧩 Custom Modules: Model authors can now customize a lot more of the components that make up Sentence Transformer models, allowing for a lot more flexibility (e.g. multi-modal, model-specific quirks, etc.)
✨ New arguments to several methods: encode_multi_process gets a progress bar, push_to_hub can now be done to different branches, and CrossEncoders can be downloaded to specific cache directories.
🐛 Bug fixes: Too many to name here, check out the release notes!
📝 Documentation: A particular focus on clarifying the batch samplers in the Package Reference this release.
Check out the full release notes here ⭐: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.1.0
I'm very excited to hear your feedback, and I'm looking forward to the future changes that I have planned, such as ONNX inference! I'm also open to suggestions for new features: feel free to send me your ideas.
📚 Trained on a large dataset of 558k Arabic triplets translated from the AllNLI triplet dataset: Omartificial-Intelligence-Space/Arabic-NLi-Triplet
6️⃣ 6 different base models: AraBERT, MarBERT, LaBSE, MiniLM, paraphrase-multilingual-mpnet-base, mpnet-base, ranging from 109M to 471M parameters.
🪆 Trained with a Matryoshka loss, allowing you to truncate embeddings with minimal performance loss: smaller embeddings are faster to compare.
📈 Outperforms all commonly used multilingual models like intfloat/multilingual-e5-large, sentence-transformers/paraphrase-multilingual-mpnet-base-v2, and sentence-transformers/LaBSE.
Check them out here:
- Omartificial-Intelligence-Space/Arabic-mpnet-base-all-nli-triplet
- Omartificial-Intelligence-Space/Arabic-all-nli-triplet-Matryoshka
- Omartificial-Intelligence-Space/Arabert-all-nli-triplet-Matryoshka
- Omartificial-Intelligence-Space/Arabic-labse-Matryoshka
- Omartificial-Intelligence-Space/Marbert-all-nli-triplet-Matryoshka
- Omartificial-Intelligence-Space/Arabic-MiniLM-L12-v2-all-nli-triplet
Or the collection with all: Omartificial-Intelligence-Space/arabic-matryoshka-embedding-models-666f764d3b570f44d7f77d4e
My personal favourite is likely Omartificial-Intelligence-Space/Arabert-all-nli-triplet-Matryoshka: a very efficient 135M parameters & scores #1 on mteb/leaderboard.
1️⃣ Gradient checkpointing allows for much less memory usage at a cost of ~20% training speed. Seems to allow for higher batch sizes, which is quite important for loss functions with in-batch negatives.
2️⃣ You can specify
args.push_to_hub=True
and args.hub_model_id
to upload your model checkpoints to Hugging Face while training. It also uploads your emissions (if codecarbon is installed) and your Tensorboard logs (if tensorboard is installed)3️⃣ Model card improvements: improved automatic widget examples, better tags, and the default of "sentence_transformers_model_id" now gets replaced when possible.
4️⃣ Several evaluator fixes, see release notes for details.
5️⃣ Fixed a bug with MatryoshkaLoss throwing an error if the supplied Matryoshka dimensions are ascending instead of descending.
6️⃣ Full Safetensors support; even the uncommon modules can now save and load "model.safetensors" files: no more pickle risks.
Check out the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.0.1
And let me know what kind of features you'd like to see next! I have some plans already (ONNX, Sparse models, ColBERT, PEFT), but I don't yet know how I should prioritize everything.
I just tried this out, and wow, it works very well!
1️⃣ Training Refactor
Embedding models can now be trained using an extensive trainer with a lot of powerful features:
- MultiGPU Training (Data Parallelism (DP) and Distributed Data Parallelism (DDP))
- bf16 training support; loss logging
- Evaluation datasets + evaluation loss
- Improved callback support + an excellent Weights & Biases integration
- Gradient checkpointing, gradient accumulation
- Improved model card generation
- Resuming from a training checkpoint without performance loss
- Hyperparameter Optimization
and much more!
Read my detailed blogpost to learn about the components that make up this new training approach: https://huggingface.co./blog/train-sentence-transformers
2️⃣ Similarity Score
Not sure how to compare embeddings? Don't worry, you can now use
model.similarity(embeddings1, embeddings2)
and you'll get your similarity scores immediately. Model authors can specify their desired similarity score, so you don't have to worry about it anymore!3️⃣ Additional Kwargs
Sentence Transformers relies on various Transformers instances (AutoModel, AutoTokenizer, AutoConfig), but it was hard to provide valuable keyword arguments to these (like 'torch_dtype=torch.bfloat16' to load a model a lower precision for 2x inference speedup). This is now easy!
4️⃣ Hyperparameter Optimization
Sentence Transformers now ships with HPO, allowing you to effectively choose your hyperparameters for your data and task.
5️⃣ Dataset Release
To help you out with finetuning models, I've released 50+ ready-to-go datasets that can be used with training or finetuning embedding models: sentence-transformers/embedding-model-datasets-6644d7a3673a511914aa7552
Full release notes: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.0.0
Very impressive! It seems excellent at Dutch, too
There are 3 models released:
- numind/NuNER_Zero:
The primary model, SOTA & can detect really long entities.
- numind/NuNER_Zero-span:
Slightly better performance than NuNER Zero, but can't detect entities longer than 12 tokens.
- numind/NuNER_Zero-4k:
Slightly worse than NuNER Zero, but has a context length of 4k tokens.
Some more details about these models in general:
- They are *really* small, orders of magnitude smaller than LLMs, which don't reach this level of performance.
- Because they're small - they're fast: <1s per sentence on free GPUs.
- They have an MIT license: free commercial usage.
Try out the demo here: https://huggingface.co./spaces/numind/NuZero
Or check out all of the models here: numind/nunerzero-zero-shot-ner-662b59803b9b438ff56e49e2
If there's ever a need for me to extract some information from any text: I'll be using these. Great work @Serega6678 !
- French Embedding Models: https://huggingface.co./collections/antoinelouis/dense-single-vector-bi-encoders-651523c0c75a3d4c44fc864d
- French Reranker Models: antoinelouis/cross-encoder-rerankers-651523f16efa656d1788a239
- French Multi-vector Models: https://huggingface.co./collections/antoinelouis/dense-multi-vector-bi-encoders-6589a8ee6b17c06872e9f075
- Multilingual Models: https://huggingface.co./collections/antoinelouis/modular-retrievers-65d53d0db64b1d644aea620c
A lot of these models use the MS MARCO Hard Negatives dataset, which I'm currently reformatting to be more easily usable. Notably, they should work out of the box without any pre-processing for training embedding models in the upcoming Sentence Transformers v3.
Oooh, Dataset.take
should be very convenient. No more .select(range(...))
🚀
I'm concerned about the low training speed (10x slower). Do we know anything about the inference latency as well? I think that's key to figure out whether this is viable or not.
Thanks for writing out this list! I try my best to keep up, but even I missed some of these
I quite enjoy the speed of these, well done.
Nice job! What are your findings so far? Can you reasonably handle the lengths that they claim?
1️⃣ A new loss function: CachedGISTEmbedLoss
This loss function is a combination of CachedMultipleNegativesRankingLoss and the GISTEmbedLoss, both of which are already excellent. The caching mechanism allows for much higher batch sizes with constant memory usage, which boosts training performance. The GIST part introduces a guide model to guide the in-batch negative sample selection. This prevents false negatives, resulting in a stronger training signal.
2️⃣ Automatic Matryoshka model truncation
Matryoshka models produce embeddings that are still useful after truncation. However, this truncation always had to be done manually, until now! We've added a
truncate_dim
option to the Sentence Transformer constructor. This also allows truncation when using HuggingFaceEmbeddings
from LlamaIndex or LangChain.3️⃣ Additionally, you can now specify
truncate_dim
in evaluators to get the performance after truncation. (Hint: it's surprisingly good, even for models not trained with MatryoshkaLoss, and it can speed up e.g. clustering, retrieval, etc.)4️⃣ CrossEncoder improvements
The CrossEncoder now supports 'push_to_hub' to upload trained reranker models to Hugging Face. Additionally, CrossEncoders now support
trust_remote_code
to load models with custom modelling code.5️⃣ Inference on Intel Gaudi2
If you have an Intel Gaudi2 Accelerator, Sentence Transformers now uses it automatically for even faster inference. No changes are necessary to your code, the device is automatically detected!
Check out the release notes for all of the details: https://github.com/UKPLab/sentence-transformers/releases/tag/v2.7.0
I'm very excited for the upcoming releases: I'm making great progress with a notable v3 refactor that should heavily improve the training process for embedding models!
Awesome! I reckon this'll make it a lot easier to quickly share, save & load some annotation work.
Very glad to see more uses of embedding quantization, great job.
The Recurrent Gemma is very intriguing to me. I'm looking forward to reading more about the RNN-based models when I have some more spare time.
Very exciting! I see you've already created a demo for it here: https://huggingface.co./spaces/urchade/gliner_multiv2.1
Looking forward to your blogpost! It's always exciting to see solid non-generative models.
float32
embeddings to binary or int8
embeddings. This saves 32x or 4x memory & disk space, and these embeddings are much easier to compare!Our results show 25-45x speedups in retrieval compared to full-size embeddings, while keeping 96% of the performance!
Learn more about it in our blogpost in collaboration with mixedbread.ai: https://huggingface.co./blog/embedding-quantization
Or try out our demo where we use quantized embeddings to let you search all of Wikipedia (yes, 41,000,000 texts) in 1 second on a CPU Space: sentence-transformers/quantized-retrieval
Here's a few resources to get you started with them:
- All Sentence Transformer models: https://huggingface.co./models?library=sentence-transformers&sort=trending
- Sentence Transformer documentation: https://sbert.net/
- Massive Text Embedding Benchmark (MTEB) Leaderboard: mteb/leaderboard
The embedding space is extremely active right now, so if you're using an embedding model for your retrieval, semantic similarity, reranking, classification, clustering, etc., then be sure to keep an eye out on the trending Sentence Transformer models & new models on MTEB.
Also, I'm curious if you've ever used Sentence Transformers via a third party library, like a RAG framework or vector database. I'm quite interested in more integrations to bring everyone free, efficient & powerful embedding models!
It seems that the Space has moved to: https://huggingface.co./spaces/DeepMount00/universal_ner_ita
And the model is now public: https://huggingface.co./DeepMount00/universal_ner_ita
Since then, a little known research paper introduced GLiNER, which was a modified & finetuned variant of the microsoft/deberta-v3-base line of models. Notably, GLiNER outperforms UniNER-7B, despite being almost 2 orders of magnitude smaller! It also allows for multiple labels at once, supports nested NER, and the models are Apache 2.0.
Very recently, the models were uploaded to Hugging Face, and I was inspired to create a demo for the English model. The demo runs on CPU, and can still very efficiently compute labels with great performance. I'm very impressed at the models.
There are two models right now:
* base (english): urchade/gliner_base
* multi (multilingual): urchade/gliner_multi
And my demo to experiment with the base model can be found here: https://huggingface.co./spaces/tomaarsen/gliner_base
I made a demo for the base model! It works like a charm: https://huggingface.co./spaces/tomaarsen/gliner_base
I've had the same idea before as well! I think this should work as well, but I haven't had time to do the research myself. Perhaps @SeanLee97 is interested in trying this out?
1. Matryoshka Loss function - you can now train & perform inference on 🪆 Matryoshka Embedding models. See also our blogpost: https://huggingface.co./blog/matryoshka
2. CoSENTLoss & AnglELoss: State of the art loss functions. These are quite interesting, they outperform CosineSimilarityLoss on nearly all benchmarks as a drop-in replacement! See also the docs: https://sbert.net/docs/package_reference/losses.html#cosentloss
3. Prompt templates: Many popular models such as intfloat/multilingual-e5-large and BAAI/bge-large-en-v1.5 prefix their texts with prompts, so this adds configuration options to automatically include prompts using
model.encode(..., prompt_name="query")
which will include a prompt with the name "query". More info in the docs: https://sbert.net/examples/applications/computing-embeddings/README.html#prompt-templates4. Instructor support: Support for the INSTRUCTOR line of models, such as hkunlp/instructor-large. Learn how to use them here: https://sbert.net/docs/pretrained_models.html#instructor-models
5. Removed NLTK & sentencepiece dependencies: Should allow for a smaller installation & a slightly faster import!
6. Updated documentation: a new Loss Overview section: https://sbert.net/docs/training/loss_overview.html and more detailed loss functions: https://sbert.net/docs/package_reference/losses.html
And much more! See the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/tag/v2.4.0
Some more very exciting updates are still on the horizon!
I've been working hard to get my HF Inbox down, but now my emails have started overflowing 🙃
Awesome! Very promising
I did not expect that many datasets to have such notable issues! Very interesting, thanks for sharing.
I would also be interested in the data quality bot that you describe at the end - I think that would be quite useful.
I've just uploaded v2.3.1 as well! It includes a niche bug fix for some local models. See more details here: https://github.com/UKPLab/sentence-transformers/releases/tag/v2.3.1
Details:
⬆ Uploading Models to the Hub with
save_to_hub
.⬇ Downloading Models from the Hub now downloads only necessary files.
⚙ Custom Models (such as jinaai/jina-embeddings-v2-base-de) can now be loaded with
trust_remote_code=True
.🔍 Models can now be loaded at specific revisions (e.g. commit hashes or git branches).
🖥️ Various device fixes; models will now always operate on the device that you specify.
📉 A new "Cached" variant of the powerful Multiple Negatives Ranking Loss allows common hardware to reach performance previously only accessible on multi-gpu clusters.
🐎 Computation time of Community Detection was decreased significantly (7x speedup at 500k sentences :exploding_head:)
🪶 Removed the now unnecessary "torchvision" dependency for a smaller installation.
Check out the full changelog here: https://github.com/UKPLab/sentence-transformers/releases/tag/v2.3.0
I'll be working on much more changes in the near future, so expect more exciting updates. If you encounter any issues, or have any questions or feature requests, don't hesitate to open an issue on the repository: https://github.com/UKPLab/sentence-transformers/issues