KaLM-embedding
Collection
7 items
•
Updated
•
22
KaLM-Embedding is a series of embedding models adapted from auto-regressive LLMs with superior training data.
KaLM-embedding-multilingual-mini is trained from Qwen/Qwen2-0.5B with massive weakly-supervised pre-training and supervised fine-tuning data.
Model Name | Model Size | C-MTEB(35) | MTEB(56) | avg |
---|---|---|---|---|
multilingual-e5-large | 560M | 58.81 | 61.5 | 60.16 |
bge-m3 (dense) | 560M | 60.80 | 59.84 | 60.32 |
gte-multilingual-base (dense) | 305M | 62.72 | 61.40 | 62.06 |
KaLM-embedding-multilingual-mini-v1 | 494M | 62.31 | 61.87 | 62.09 |
KaLM-embedding-multilingual-mini-instruct-v1 | 494M | 63.57 | 64.74 | 64.16 |
KaLM-embedding-multilingual-mini-instruct-v1.5 | 494M | 64.13 | 64.94 | 64.53 |
Since we have used the Qwen2 model, we advise you to install transformers>=4.37.0
, or you might encounter the following error:
KeyError: 'qwen2'
Using this model becomes easy when you have sentence-transformers installed:
pip install -U sentence-transformers
Then you can use the model like this:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('{MODEL_NAME_OR_PATH}') # Do NOT set trust_remote_code
model.max_seq_length = 512
embeddings = model.encode(
sentences,
normalize_embeddings=True,
batch_size=256,
show_progress_bar=True
)
print(embeddings)
We add instruction for classification and clustering. If you want to add instruction to the query (no instruction for the corpus), you can use the model like this:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('{MODEL_NAME_OR_PATH}') # Do NOT set trust_remote_code
model.max_seq_length = 512
prompt = "Instruct: Classifying the category of french news. \n Query: "
embeddings = model.encode(
sentences,
prompt=prompt,
normalize_embeddings=True,
batch_size=256,
show_progress_bar=True
)
print(embeddings)
If you encounter any issue, feel free to contact us via the email: [email protected]