mono-xm / README.md

Update README.md

0a0728e verified 11 months ago

5.57 kB

	---
	pipeline_tag: text-classification
	datasets:
	- ms_marco
	- sentence-transformers/msmarco-hard-negatives
	metrics:
	- recall
	tags:
	- passage-reranking
	library_name: sentence-transformers
	base_model: facebook/xmod-base
	inference: false
	language:
	- multilingual
	- af
	- am
	- ar
	- az
	- be
	- bg
	- bn
	- ca
	- cs
	- cy
	- da
	- de
	- el
	- en
	- eo
	- es
	- et
	- eu
	- fa
	- fi
	- fr
	- ga
	- gl
	- gu
	- ha
	- he
	- hi
	- hr
	- hu
	- hy
	- id
	- is
	- it
	- ja
	- ka
	- kk
	- km
	- kn
	- ko
	- ku
	- ky
	- la
	- lo
	- lt
	- lv
	- mk
	- ml
	- mn
	- mr
	- ms
	- my
	- ne
	- nl
	- no
	- or
	- pa
	- pl
	- ps
	- pt
	- ro
	- ru
	- sa
	- si
	- sk
	- sl
	- so
	- sq
	- sr
	- sv
	- sw
	- ta
	- te
	- th
	- tl
	- tr
	- uk
	- ur
	- uz
	- vi
	- zh
	---

	<h1 align="center">Mono-XM</h1>


	<h4 align="center">
	<p>
	<a href=#usage>🛠️ Usage</a> \|
	<a href="#evaluation">📊 Evaluation</a> \|
	<a href="#train">🤖 Training</a> \|
	<a href="#citation">🔗 Citation</a> \|
	<a href="https://github.com/ant-louis/xm-retrievers">💻 Code</a>
	<p>
	</h4>


	This is a multilingual reranking model. It performs cross-attention between a question-passage
	pair and outputs a relevance score between 0 and 1. The model should be used as a reranker for semantic search: given a query, encode the latter with some candidate
	passages -- e.g., retrieved with BM25 or a bi-encoder -- then sort the passages in a decreasing order of relevance according to the model's predictions.
	The model uses an [XMOD](https://huggingface.co./facebook/xmod-base) backbone, which allows it to learn from monolingual fine-tuning
	in a high-resource language, like English, and performs zero-shot transfer to other languages.

	## Usage

	Here are some examples for using the model with [Sentence-Transformers](#using-sentence-transformers), [FlagEmbedding](#using-flagembedding), or [Huggingface Transformers](#using-huggingface-transformers).

	#### Using Sentence-Transformers

	Start by installing the [library](https://www.SBERT.net): `pip install -U sentence-transformers`. Then, you can use the model like this:

	```python
	from sentence_transformers import CrossEncoder

	pairs = [
	('Première question', 'Ceci est un paragraphe pertinent.'),
	('Voici une autre requête', 'Et voilà un paragraphe non pertinent.'),
	]
	language_code = "fr_FR" #Find all codes here: https://huggingface.co./facebook/xmod-base#languages

	model = CrossEncoder('antoinelouis/mono-xm')
	model.model.set_default_language(language_code) #Activate the language-specific adapters

	scores = model.predict(pairs)
	print(scores)
	```

	#### Using FlagEmbedding

	Start by installing the [library](https://github.com/FlagOpen/FlagEmbedding/): `pip install -U FlagEmbedding`. Then, you can use the model like this:

	```python
	from FlagEmbedding import FlagReranker

	pairs = [
	('Première question', 'Ceci est un paragraphe pertinent.'),
	('Voici une autre requête', 'Et voilà un paragraphe non pertinent.'),
	]
	language_code = "fr_FR" #Find all codes here: https://huggingface.co./facebook/xmod-base#languages

	model = FlagReranker('antoinelouis/mono-xm')
	model.model.set_default_language(language_code) #Activate the language-specific adapters

	scores = model.compute_score(pairs)
	print(scores)
	```

	#### Using Transformers

	Start by installing the [library](https://huggingface.co./docs/transformers): `pip install -U transformers`. Then, you can use the model like this:

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	pairs = [
	('Première question', 'Ceci est un paragraphe pertinent.'),
	('Voici une autre requête', 'Et voilà un paragraphe non pertinent.'),
	]
	language_code = "fr_FR" #Find all codes here: https://huggingface.co./facebook/xmod-base#languages

	tokenizer = AutoTokenizer.from_pretrained('antoinelouis/mono-xm')
	model = AutoModelForSequenceClassification.from_pretrained('antoinelouis/mono-xm')
	model.set_default_language(language_code) #Activate the language-specific adapters

	features = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt')
	with torch.no_grad():
	scores = model(**features).logits
	print(scores)
	```

	***

	## Evaluation

	[to come...]

	***

	## Training

	#### Data

	We use the English training samples from the [MS MARCO passage ranking](https://ir-datasets.com/msmarco-passage.html#msmarco-passage/train) dataset, which contains
	8.8M passages and 539K training queries. We use the BM25 negatives provided by the official dataset and sample 1M (q, p) pairs with a 1/4 positive-to-negative ratio
	(i.e., 250k query-positive pairs for 750k query-negative pairs).

	#### Implementation

	The model is initialized from the [xmod-base](https://huggingface.co./facebook/xmod-base) checkpoint and optimized via the binary cross-entropy loss
	(as in [monoBERT](https://doi.org/10.48550/arXiv.1910.14424)). It is fine-tuned on one 32GB NVIDIA V100 GPU for 5 epochs using the AdamW optimizer with
	a batch size of 32, a peak learning rate of 2e-5 with warm up along the first 10\% of training steps and linear scheduling. We set the maximum sequence
	lengths for the concatenated question-passage pairs to 512 tokens.

	***

	## Citation

	```bibtex
	@article{louis2024modular,
	author = {Louis, Antoine and Saxena, Vageesh and van Dijck, Gijs and Spanakis, Gerasimos},
	title = {ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot Multilingual Information Retrieval},
	journal = {CoRR},
	volume = {abs/2402.15059},
	year = {2024},
	url = {https://arxiv.org/abs/2402.15059},
	doi = {10.48550/arXiv.2402.15059},
	eprinttype = {arXiv},
	eprint = {2402.15059},
	}
	```