Text Classification
sentence-transformers
Safetensors
xmod
passage-reranking
mono-xm / README.md
antoinelouis's picture
Update README.md
4e95be6 verified
|
raw
history blame
5.64 kB
---
pipeline_tag: text-classification
datasets:
- ms_marco
- sentence-transformers/msmarco-hard-negatives
metrics:
- recall
tags:
- passage-reranking
library_name: sentence-transformers
base_model: facebook/xmod-base
inference: false
language:
- multilingual
- af
- am
- ar
- az
- be
- bg
- bn
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- ga
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- no
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- si
- sk
- sl
- so
- sq
- sr
- sv
- sw
- ta
- te
- th
- tl
- tr
- uk
- ur
- uz
- vi
- zh
---
<h1 align="center">Mono-XM</h1>
<h4 align="center">
<p>
<a href=#usage>🛠️ Usage</a> |
<a href="#evaluation">📊 Evaluation</a> |
<a href="#train">🤖 Training</a> |
<a href="#citation">🔗 Citation</a> |
<a href="https://github.com/ant-louis/xm-retrievers">💻 Code</a>
<p>
</h4>
This is a [sentence-transformers](https://www.sbert.net/examples/applications/cross-encoder/README.html) model. It performs cross-attention between a question-passage
pair and outputs a relevance score between 0 and 1. The model should be used as a reranker for semantic search: given a query, encode the latter with some candidate
passages -- e.g., retrieved with BM25 or a bi-encoder -- then sort the passages in a decreasing order of relevance according to the model's predictions.
The model uses an [XMOD](https://huggingface.co./facebook/xmod-base) backbone, which allows it to learn from monolingual fine-tuning
in a high-resource language, like English, and performs zero-shot transfer to other languages.
## Usage
Here are some examples for using the model with [Sentence-Transformers](#using-sentence-transformers), [FlagEmbedding](#using-flagembedding), or [Huggingface Transformers](#using-huggingface-transformers).
#### Using Sentence-Transformers
Start by installing the [library](https://www.SBERT.net): `pip install -U sentence-transformers`. Then, you can use the model like this:
```python
from sentence_transformers import CrossEncoder
pairs = [
('Première question', 'Ceci est un paragraphe pertinent.'),
('Voici une autre requête', 'Et voilà un paragraphe non pertinent.'),
]
language_code = "fr_FR" #Find all codes here: https://huggingface.co./facebook/xmod-base#languages
model = CrossEncoder('antoinelouis/mono-xm')
model.model.set_default_language(language_code) #Activate the language-specific adapters
scores = model.predict(pairs)
print(scores)
```
#### Using FlagEmbedding
Start by installing the [library](https://github.com/FlagOpen/FlagEmbedding/): `pip install -U FlagEmbedding`. Then, you can use the model like this:
```python
from FlagEmbedding import FlagReranker
pairs = [
('Première question', 'Ceci est un paragraphe pertinent.'),
('Voici une autre requête', 'Et voilà un paragraphe non pertinent.'),
]
language_code = "fr_FR" #Find all codes here: https://huggingface.co./facebook/xmod-base#languages
model = FlagReranker('antoinelouis/mono-xm')
model.model.set_default_language(language_code) #Activate the language-specific adapters
scores = model.compute_score(pairs)
print(scores)
```
#### Using Transformers
Start by installing the [library](https://huggingface.co./docs/transformers): `pip install -U transformers`. Then, you can use the model like this:
```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
pairs = [
('Première question', 'Ceci est un paragraphe pertinent.'),
('Voici une autre requête', 'Et voilà un paragraphe non pertinent.'),
]
language_code = "fr_FR" #Find all codes here: https://huggingface.co./facebook/xmod-base#languages
tokenizer = AutoTokenizer.from_pretrained('antoinelouis/mono-xm')
model = AutoModelForSequenceClassification.from_pretrained('antoinelouis/mono-xm')
model.set_default_language(language_code) #Activate the language-specific adapters
features = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
scores = model(**features).logits
print(scores)
```
***
## Evaluation
[to come...]
***
## Training
#### Data
We use the English training samples from the [MS MARCO passage ranking](https://ir-datasets.com/msmarco-passage.html#msmarco-passage/train) dataset, which contains
8.8M passages and 539K training queries. We use the BM25 negatives provided by the official dataset and sample 1M (q, p) pairs with a 1/4 positive-to-negative ratio
(i.e., 250k query-positive pairs for 750k query-negative pairs).
#### Implementation
The model is initialized from the [xmod-base](https://huggingface.co./facebook/xmod-base) checkpoint and optimized via the binary cross-entropy loss
(as in [monoBERT](https://doi.org/10.48550/arXiv.1910.14424)). It is fine-tuned on one 32GB NVIDIA V100 GPU for 5 epochs using the AdamW optimizer with
a batch size of 32, a peak learning rate of 2e-5 with warm up along the first 10\% of training steps and linear scheduling. We set the maximum sequence
lengths for the concatenated question-passage pairs to 512 tokens.
***
## Citation
```bibtex
@article{louis2024modular,
author = {Louis, Antoine and Saxena, Vageesh and van Dijck, Gijs and Spanakis, Gerasimos},
title = {ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot Multilingual Information Retrieval},
journal = {CoRR},
volume = {abs/2402.15059},
year = {2024},
url = {https://arxiv.org/abs/2402.15059},
doi = {10.48550/arXiv.2402.15059},
eprinttype = {arXiv},
eprint = {2402.15059},
}
```