Zhihui_LLM_Embedding
Model Introduction
Zhihui_LLM_Embedding is an embedding model specifically designed to enhance Chinese text retrieval capabilities. It is built on a 7B LLM and enhanced bidirectional attention mechanism to improved contextual understanding. The model is trained on an extensive corpus from various fields within an extremely large batch. Zhihui_LLM_Embedding excels in retrieval tasks, ranking 1st position on the C-MTEB leaderboard with a leading performance score of 76.74 as of June 25, 2024.
Optimization points
- Data source enhancement: Leverages the knowledge of LLMs through three types of distillation methods.(GPT3.5 & GPT4)
- Data Refinement: LLM scores candidate positive passages to select the most relevant examples.
- Query Rewriting: LLM generates queries that can be answered by positive documents but are unrelated to negatives, thus enhancing the query's quality and diversity.
- Query Expansion: Queries are expanded based on multiple topics for long documents.
- Negative example mining: Use multiple methods and different ranges of negative selection to mine hard negative examples.
- Improved Contrastive Loss: Design a novel InfoNCE loss assigns higher weights to the harder negative examples to improve the fine-grained feature representation of the model.
- Bidirectional-attention: Remove the causal attention of LLMs during contrastive training of decoder-only LLM to produce rich contextualized representations.
- Training efficiency: Using Gradient Cache to scale contrastive learning batches beyond GPU memory constraints allows the model to learn from more challenging negative examples.
- Others: Dataset-Homogenous Batching、cross-batch negative sampling
Model Details
- Base Decoder-only LLM: gte-Qwen2-7B-instruct
- Pooling Methods: Last token
- Embedding Dimension: 3584
Usage
Requirements
transformers>=4.40.2
flash_attn>=2.5.8
sentence-transformers>=2.7.0
How to use
Here is an example of how to encode queries and passages using Huggingface-transformer and Sentence-transformer.
Usage (HuggingFace Transformers)
import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
if left_padding:
return last_hidden_states[:, -1]
else:
sequence_lengths = attention_mask.sum(dim=1) - 1
batch_size = last_hidden_states.shape[0]
return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
def get_detailed_instruct(task_description: str, query: str) -> str:
return f'Instruct: {task_description}\nQuery: {query}'
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
get_detailed_instruct(task, "国家法定节假日共多少天"),
get_detailed_instruct(task, "如何查看好友申请")
]
documents = [
"一年国家法定节假日为11天。根据公布的国家法定节假日调整方案,调整的主要内容包括:元旦放假1天不变;春节放假3天,放假时间为农历正月初一、初二、初三;“五一”国际劳动节1天不变;“十一”国庆节放假3天;清明节、端午节、中秋节增设为国家法定节假日,各放假1天(农历节日如遇闰月,以第一个月为休假日)。3、允许周末上移下错,与法定节假日形成连休。",
"这个直接去我的QQ中心不就好了么那里可以查到 我的好友单向好友好友恢复、 以及好友申请 啊可以是你加别人的 或 别人加你的都可以查得到QQ空间里 这个没注意 要有的话也会在你进空间的时候会提示你的QQ 空间里 上面消息 就可以看见了!望采纳!谢谢这个直接去我的QQ中心不就好了么那里可以查到 我的好友单向好友好友恢复、 以及好友申请 啊可以是你加别人的 或 别人加你的都可以查得到",
]
input_texts = queries + documents
tokenizer = AutoTokenizer.from_pretrained('Lenovo-Zhihui/Zhihui_LLM_Embedding', trust_remote_code=True)
model = AutoModel.from_pretrained('Lenovo-Zhihui/Zhihui_LLM_Embedding', trust_remote_code=True)
max_length = 512
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=max_length, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())
Usage (Sentence-Transformers)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Lenovo-Zhihui/Zhihui_LLM_Embedding", trust_remote_code=True)
model.max_seq_length = 512
# 数据来源DuRetrieval https://huggingface.co./datasets/C-MTEB/DuRetrieval
queries = [
"国家法定节假日共多少天",
"如何查看好友申请",
]
documents = [
"一年国家法定节假日为11天。根据公布的国家法定节假日调整方案,调整的主要内容包括:元旦放假1天不变;春节放假3天,放假时间为农历正月初一、初二、初三;“五一”国际劳动节1天不变;“十一”国庆节放假3天;清明节、端午节、中秋节增设为国家法定节假日,各放假1天(农历节日如遇闰月,以第一个月为休假日)。3、允许周末上移下错,与法定节假日形成连休。",
"这个直接去我的QQ中心不就好了么那里可以查到 我的好友单向好友好友恢复、 以及好友申请 啊可以是你加别人的 或 别人加你的都可以查得到QQ空间里 这个没注意 要有的话也会在你进空间的时候会提示你的QQ 空间里 上面消息 就可以看见了!望采纳!谢谢这个直接去我的QQ中心不就好了么那里可以查到 我的好友单向好友好友恢复、 以及好友申请 啊可以是你加别人的 或 别人加你的都可以查得到",
]
query_embeddings = model.encode(queries, prompt_name="query", normalize_embeddings=True)
document_embeddings = model.encode(documents, normalize_embeddings=True)
scores = (query_embeddings @ document_embeddings.T)
print(scores.tolist())
Reproduce our results(C-MTEB):
Check out scripts/eval_mteb.py to reproduce evaluation results on C-MTEB benchmark.
Model | T2Retrieval | MMarcoRetrieval | DuRetrieval | CovidRetrieval | CmedqaRetrieval | EcomRetrieval | MedicalRetrieval | VideoRetrieval | Avg |
---|---|---|---|---|---|---|---|---|---|
Zhihui_LLM_Embedding | 88.30 | 84.77 | 91.34 | 84.39 | 48.69 | 71.96 | 65.19 | 79.31 | 76.74 |
zpoint_large_embedding_zh | 83.81 | 82.38 | 89.23 | 89.14 | 47.16 | 70.74 | 68.14 | 80.26 | 76.36 |
gte-Qwen2-7B-instruct | 87.73 | 85.16 | 87.44 | 83.65 | 48.69 | 71.15 | 65.59 | 78.84 | 76.03 |
360Zhinao-search | 87.12 | 83.32 | 87.57 | 85.02 | 46.73 | 68.9 | 63.69 | 78.09 | 75.06 |
AGE_Hybrid | 86.88 | 80.65 | 89.28 | 83.66 | 47.26 | 69.28 | 65.94 | 76.79 | 74.97 |
- Downloads last month
- 282
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Spaces using Lenovo-Zhihui/Zhihui_LLM_Embedding 2
Evaluation results
- map_at_1 on MTEB CmedqaRetrievalself-reported29.012
- map_at_10 on MTEB CmedqaRetrievalself-reported41.998
- map_at_100 on MTEB CmedqaRetrievalself-reported43.821
- map_at_1000 on MTEB CmedqaRetrievalself-reported43.924
- map_at_3 on MTEB CmedqaRetrievalself-reported37.804
- map_at_5 on MTEB CmedqaRetrievalself-reported40.025
- mrr_at_1 on MTEB CmedqaRetrievalself-reported43.536
- mrr_at_10 on MTEB CmedqaRetrievalself-reported51.413
- mrr_at_100 on MTEB CmedqaRetrievalself-reported52.329
- mrr_at_1000 on MTEB CmedqaRetrievalself-reported52.366