|
--- |
|
pipeline_tag: text-classification |
|
datasets: |
|
- ms_marco |
|
- sentence-transformers/msmarco-hard-negatives |
|
metrics: |
|
- recall |
|
tags: |
|
- passage-reranking |
|
library_name: sentence-transformers |
|
base_model: facebook/xmod-base |
|
inference: false |
|
language: |
|
- multilingual |
|
- af |
|
- am |
|
- ar |
|
- az |
|
- be |
|
- bg |
|
- bn |
|
- ca |
|
- cs |
|
- cy |
|
- da |
|
- de |
|
- el |
|
- en |
|
- eo |
|
- es |
|
- et |
|
- eu |
|
- fa |
|
- fi |
|
- fr |
|
- ga |
|
- gl |
|
- gu |
|
- ha |
|
- he |
|
- hi |
|
- hr |
|
- hu |
|
- hy |
|
- id |
|
- is |
|
- it |
|
- ja |
|
- ka |
|
- kk |
|
- km |
|
- kn |
|
- ko |
|
- ku |
|
- ky |
|
- la |
|
- lo |
|
- lt |
|
- lv |
|
- mk |
|
- ml |
|
- mn |
|
- mr |
|
- ms |
|
- my |
|
- ne |
|
- nl |
|
- no |
|
- or |
|
- pa |
|
- pl |
|
- ps |
|
- pt |
|
- ro |
|
- ru |
|
- sa |
|
- si |
|
- sk |
|
- sl |
|
- so |
|
- sq |
|
- sr |
|
- sv |
|
- sw |
|
- ta |
|
- te |
|
- th |
|
- tl |
|
- tr |
|
- uk |
|
- ur |
|
- uz |
|
- vi |
|
- zh |
|
--- |
|
|
|
<h1 align="center">Mono-XM</h1> |
|
|
|
|
|
<h4 align="center"> |
|
<p> |
|
<a href=#usage>🛠️ Usage</a> | |
|
<a href="#evaluation">📊 Evaluation</a> | |
|
<a href="#train">🤖 Training</a> | |
|
<a href="#citation">🔗 Citation</a> | |
|
<a href="https://github.com/ant-louis/xm-retrievers">💻 Code</a> |
|
<p> |
|
</h4> |
|
|
|
|
|
This is a **multilingual** reranking model. It performs cross-attention between a question-passage |
|
pair and outputs a relevance score between 0 and 1. The model should be used as a reranker for semantic search: given a query, encode the latter with some candidate |
|
passages -- e.g., retrieved with BM25 or a bi-encoder -- then sort the passages in a decreasing order of relevance according to the model's predictions. |
|
The model uses an [XMOD](https://huggingface.co./facebook/xmod-base) backbone, which allows it to learn from monolingual fine-tuning |
|
in a high-resource language, like English, and performs zero-shot transfer to other languages. |
|
|
|
## Usage |
|
|
|
Here are some examples for using the model with [Sentence-Transformers](#using-sentence-transformers), [FlagEmbedding](#using-flagembedding), or [Huggingface Transformers](#using-huggingface-transformers). |
|
|
|
#### Using Sentence-Transformers |
|
|
|
Start by installing the [library](https://www.SBERT.net): `pip install -U sentence-transformers`. Then, you can use the model like this: |
|
|
|
```python |
|
from sentence_transformers import CrossEncoder |
|
|
|
pairs = [ |
|
('Première question', 'Ceci est un paragraphe pertinent.'), |
|
('Voici une autre requête', 'Et voilà un paragraphe non pertinent.'), |
|
] |
|
language_code = "fr_FR" #Find all codes here: https://huggingface.co./facebook/xmod-base#languages |
|
|
|
model = CrossEncoder('antoinelouis/mono-xm') |
|
model.model.set_default_language(language_code) #Activate the language-specific adapters |
|
|
|
scores = model.predict(pairs) |
|
print(scores) |
|
``` |
|
|
|
#### Using FlagEmbedding |
|
|
|
Start by installing the [library](https://github.com/FlagOpen/FlagEmbedding/): `pip install -U FlagEmbedding`. Then, you can use the model like this: |
|
|
|
```python |
|
from FlagEmbedding import FlagReranker |
|
|
|
pairs = [ |
|
('Première question', 'Ceci est un paragraphe pertinent.'), |
|
('Voici une autre requête', 'Et voilà un paragraphe non pertinent.'), |
|
] |
|
language_code = "fr_FR" #Find all codes here: https://huggingface.co./facebook/xmod-base#languages |
|
|
|
model = FlagReranker('antoinelouis/mono-xm') |
|
model.model.set_default_language(language_code) #Activate the language-specific adapters |
|
|
|
scores = model.compute_score(pairs) |
|
print(scores) |
|
``` |
|
|
|
#### Using Transformers |
|
|
|
Start by installing the [library](https://huggingface.co./docs/transformers): `pip install -U transformers`. Then, you can use the model like this: |
|
|
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
pairs = [ |
|
('Première question', 'Ceci est un paragraphe pertinent.'), |
|
('Voici une autre requête', 'Et voilà un paragraphe non pertinent.'), |
|
] |
|
language_code = "fr_FR" #Find all codes here: https://huggingface.co./facebook/xmod-base#languages |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('antoinelouis/mono-xm') |
|
model = AutoModelForSequenceClassification.from_pretrained('antoinelouis/mono-xm') |
|
model.set_default_language(language_code) #Activate the language-specific adapters |
|
|
|
features = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt') |
|
with torch.no_grad(): |
|
scores = model(**features).logits |
|
print(scores) |
|
``` |
|
|
|
*** |
|
|
|
## Evaluation |
|
|
|
[to come...] |
|
|
|
*** |
|
|
|
## Training |
|
|
|
#### Data |
|
|
|
We use the English training samples from the [MS MARCO passage ranking](https://ir-datasets.com/msmarco-passage.html#msmarco-passage/train) dataset, which contains |
|
8.8M passages and 539K training queries. We use the BM25 negatives provided by the official dataset and sample 1M (q, p) pairs with a 1/4 positive-to-negative ratio |
|
(i.e., 250k query-positive pairs for 750k query-negative pairs). |
|
|
|
#### Implementation |
|
|
|
The model is initialized from the [xmod-base](https://huggingface.co./facebook/xmod-base) checkpoint and optimized via the binary cross-entropy loss |
|
(as in [monoBERT](https://doi.org/10.48550/arXiv.1910.14424)). It is fine-tuned on one 32GB NVIDIA V100 GPU for 5 epochs using the AdamW optimizer with |
|
a batch size of 32, a peak learning rate of 2e-5 with warm up along the first 10\% of training steps and linear scheduling. We set the maximum sequence |
|
lengths for the concatenated question-passage pairs to 512 tokens. |
|
|
|
*** |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@article{louis2024modular, |
|
author = {Louis, Antoine and Saxena, Vageesh and van Dijck, Gijs and Spanakis, Gerasimos}, |
|
title = {ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot Multilingual Information Retrieval}, |
|
journal = {CoRR}, |
|
volume = {abs/2402.15059}, |
|
year = {2024}, |
|
url = {https://arxiv.org/abs/2402.15059}, |
|
doi = {10.48550/arXiv.2402.15059}, |
|
eprinttype = {arXiv}, |
|
eprint = {2402.15059}, |
|
} |
|
``` |