gte-large-en-v1.5
We introduce gte-v1.5
series, upgraded gte
embeddings that support the context length of up to 8192, while further enhancing model performance.
The models are built upon the transformer++
encoder backbone (BERT + RoPE + GLU).
The gte-v1.5
series achieve state-of-the-art scores on the MTEB benchmark within the same model size category and prodvide competitive on the LoCo long-context retrieval tests (refer to Evaluation).
We also present the gte-Qwen1.5-7B-instruct
,
a SOTA instruction-tuned multi-lingual embedding model that ranked 2nd in MTEB and 1st in C-MTEB.
- Developed by: Institute for Intelligent Computing, Alibaba Group
- Model type: Text Embeddings
- Paper: mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval
Model list
Models | Language | Model Size | Max Seq. Length | Dimension | MTEB-en | LoCo |
---|---|---|---|---|---|---|
gte-Qwen1.5-7B-instruct |
Multiple | 7720 | 32768 | 4096 | 67.34 | 87.57 |
gte-large-en-v1.5 |
English | 434 | 8192 | 1024 | 65.39 | 86.71 |
gte-base-en-v1.5 |
English | 137 | 8192 | 768 | 64.11 | 87.44 |
How to Get Started with the Model
Use the code below to get started with the model.
# Requires transformers>=4.36.0
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
input_texts = [
"what is the capital of China?",
"how to implement quick sort in python?",
"Beijing",
"sorting algorithms"
]
model_path = 'Alibaba-NLP/gte-large-en-v1.5'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=8192, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = outputs.last_hidden_state[:, 0]
# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
It is recommended to install xformers and enable unpadding for acceleration, refer to enable-unpadding-and-xformers.
Use with sentence-transformers:
# Requires sentence_transformers>=2.7.0
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
sentences = ['That is a happy person', 'That is a very happy person']
model = SentenceTransformer('Alibaba-NLP/gte-large-en-v1.5', trust_remote_code=True)
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))
Use with transformers.js
:
// npm i @xenova/transformers
import { pipeline, dot } from '@xenova/transformers';
// Create feature extraction pipeline
const extractor = await pipeline('feature-extraction', 'Alibaba-NLP/gte-large-en-v1.5', {
quantized: false, // Comment out this line to use the quantized version
});
// Generate sentence embeddings
const sentences = [
"what is the capital of China?",
"how to implement quick sort in python?",
"Beijing",
"sorting algorithms"
]
const output = await extractor(sentences, { normalize: true, pooling: 'cls' });
// Compute similarity scores
const [source_embeddings, ...document_embeddings ] = output.tolist();
const similarities = document_embeddings.map(x => 100 * dot(source_embeddings, x));
console.log(similarities); // [41.86354093370361, 77.07076371259589, 37.02981979677899]
Training Details
Training Data
- Masked language modeling (MLM):
c4-en
- Weak-supervised contrastive pre-training (CPT): GTE pre-training data
- Supervised contrastive fine-tuning: GTE fine-tuning data
Training Procedure
To enable the backbone model to support a context length of 8192, we adopted a multi-stage training strategy. The model first undergoes preliminary MLM pre-training on shorter lengths. And then, we resample the data, reducing the proportion of short texts, and continue the MLM pre-training.
The entire training process is as follows:
- MLM-512: lr 2e-4, mlm_probability 0.3, batch_size 4096, num_steps 300000, rope_base 10000
- MLM-2048: lr 5e-5, mlm_probability 0.3, batch_size 4096, num_steps 30000, rope_base 10000
- MLM-8192: lr 5e-5, mlm_probability 0.3, batch_size 1024, num_steps 30000, rope_base 160000
- CPT: max_len 512, lr 5e-5, batch_size 28672, num_steps 100000
- Fine-tuning: TODO
Evaluation
MTEB
The results of other models are retrieved from MTEB leaderboard.
The gte evaluation setting: mteb==1.2.0, fp16 auto mix precision, max_length=8192
, and set ntk scaling factor to 2 (equivalent to rope_base * 2).
Model Name | Param Size (M) | Dimension | Sequence Length | Average (56) | Class. (12) | Clust. (11) | Pair Class. (3) | Reran. (4) | Retr. (15) | STS (10) | Summ. (1) |
---|---|---|---|---|---|---|---|---|---|---|---|
gte-large-en-v1.5 | 409 | 1024 | 8192 | 65.39 | 77.75 | 47.95 | 84.63 | 58.50 | 57.91 | 81.43 | 30.91 |
mxbai-embed-large-v1 | 335 | 1024 | 512 | 64.68 | 75.64 | 46.71 | 87.2 | 60.11 | 54.39 | 85 | 32.71 |
multilingual-e5-large-instruct | 560 | 1024 | 514 | 64.41 | 77.56 | 47.1 | 86.19 | 58.58 | 52.47 | 84.78 | 30.39 |
bge-large-en-v1.5 | 335 | 1024 | 512 | 64.23 | 75.97 | 46.08 | 87.12 | 60.03 | 54.29 | 83.11 | 31.61 |
gte-base-en-v1.5 | 137 | 768 | 8192 | 64.11 | 77.17 | 46.82 | 85.33 | 57.66 | 54.09 | 81.97 | 31.17 |
bge-base-en-v1.5 | 109 | 768 | 512 | 63.55 | 75.53 | 45.77 | 86.55 | 58.86 | 53.25 | 82.4 | 31.07 |
LoCo
Model Name | Dimension | Sequence Length | Average (5) | QsmsumRetrieval | SummScreenRetrieval | QasperAbastractRetrieval | QasperTitleRetrieval | GovReportRetrieval |
---|---|---|---|---|---|---|---|---|
gte-qwen1.5-7b | 4096 | 32768 | 87.57 | 49.37 | 93.10 | 99.67 | 97.54 | 98.21 |
gte-large-v1.5 | 1024 | 8192 | 86.71 | 44.55 | 92.61 | 99.82 | 97.81 | 98.74 |
gte-base-v1.5 | 768 | 8192 | 87.44 | 49.91 | 91.78 | 99.82 | 97.13 | 98.58 |
Citation
If you find our paper or models helpful, please consider citing them as follows:
@article{zhang2024mgte,
title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others},
journal={arXiv preprint arXiv:2407.19669},
year={2024}
}
@article{li2023towards,
title={Towards general text embeddings with multi-stage contrastive learning},
author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan},
journal={arXiv preprint arXiv:2308.03281},
year={2023}
}
- Downloads last month
- 1,003
Dataset used to train pingkeest/learning2_model
Evaluation results
- accuracy on MTEB AmazonCounterfactualClassification (en)test set self-reported73.015
- ap on MTEB AmazonCounterfactualClassification (en)test set self-reported35.053
- f1 on MTEB AmazonCounterfactualClassification (en)test set self-reported66.713
- accuracy on MTEB AmazonPolarityClassificationtest set self-reported93.972
- ap on MTEB AmazonPolarityClassificationtest set self-reported90.595
- f1 on MTEB AmazonPolarityClassificationtest set self-reported93.958
- accuracy on MTEB AmazonReviewsClassification (en)test set self-reported54.196
- f1 on MTEB AmazonReviewsClassification (en)test set self-reported53.801
- map_at_1 on MTEB ArguAnatest set self-reported47.297
- map_at_10 on MTEB ArguAnatest set self-reported64.303