LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models

LUSIFER is framework for bridging the gap between multilingual understanding and task-specific text embeddings without relying on explicit multilingual supervision. It does this by combining a multilingual encoder (providing a universal language foundation) with an LLM-based embedding model (optimized for embedding tasks), connected through a minimal set of trainable parameters. LUSIFER also introduces two stages of training process: 1) Alignment Training and 2) Representation Fine-tuning to optimize the model for zero-shot multilingual embeddings.

Installation

To use LUSFIER, install evironment from environment.yaml (optional)

conda env create -f environment.yaml

After that, you can install our package from source by

git clone https://github.com/hieum98/lusifer.git
cd lusifer
pip install -e .

You also need to install the Flash-Attention before running the code because we use the Flash-Attention as the attention implementation in our model. You can install the Flash-Attention by running the following command:

pip install packaging
pip install ninja
pip install flash-attn --no-build-isolation

Getting started

LUSIFER provides a thorough set of tools for training, evaluating, and using the model. The following sections provide a brief overview of how to use the model for training, evaluation, and inference.

Preparing the model

LUSIFER model can be easily loaded using the from_pretrained method. The model can be loaded from the Hugging Face model hub by providing the model name or path to the model weights. The following code snippet demonstrates how to load the model from the Hugging Face model hub.

from lusifer.models.lusifer import Lusifer

model = Lusifer.from_pretrained("Hieuman/LUSIFER")

Inference

This model now returns the text embedding for any input in the form of str or List[str]. The model also can receive instruction alongside the sentence.

import torch
from lusifer.models.lusifer import Lusifer

model = Lusifer.from_pretrained("Hieuman/LUSIFER")

model = model.to("cuda")

# Encoding queries using instructions
instruction =  "Given a web search query, retrieve relevant passages that answer the query:"
queries = [
    "how much protein should a female eat",
    "summit define",
]
q_reps = model.encode(sentences=queries)

# Encoding documents. Instruction are not required for documents
documents = [
    "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments.",
]
d_reps = model.encode(sentences=documents)

# Compute cosine similarity
q_reps_norm = torch.nn.functional.normalize(torch.from_numpy(q_reps), p=2, dim=1)
d_reps_norm = torch.nn.functional.normalize(torch.from_numpy(d_reps), p=2, dim=1)
cos_sim = torch.mm(q_reps_norm, d_reps_norm.transpose(0, 1))

print(cos_sim)

Training

Alignment Training

To train the model in the alignment stage, run the following command:

python -m src.main     \
    --config_file scripts/configs/aligment_training_reconstruction_and_completion.yaml     \
    --nodes 1     \
    --devices 4  

It will run the alignment training on 4 GPUs with both reconstruction and completion tasks with the configuration in the scripts/configs/aligment_training_reconstruction_and_completion.yaml file. For more details about the configuration file, please refer to the scripts/configs/aligment_training_reconstruction_and_completion.yaml file and the arguments in the lusifer/args.py file.

We also provide the configuration file for the alignment training with the reconstruction task only in the scripts/configs/alignment_training_reconstruction.yaml file. We suggest using the reconstruction task only first to stabilize the training process before adding the completion task.

Representation Fine-tuning

To train the model in the representation fine-tuning stage, run the following command:

python -m src.main     \
    --config_file scripts/configs/representation_fintuning_retrieval_data_only.yaml     \
    --nodes 1     \
    --devices 4  

We also provide the configuration file for the representation fine-tuning with both retrieval and non-retrieval data in the scripts/configs/representation_finetuning_all.yaml file. We suggest using the retrieval data only first to stabilize the training process before adding the non-retrieval data.

To be concise, we suggest the following training process: reconstruction task only -> reconstruction + completion task -> retrieval data only -> retrieval + non-retrieval data.

Evaluation

We propose a new benchmark for evaluating the model on the multilingual text embedding task. The benchmark includes 5 primary embedding tasks: Classification, Clustering, Reranking, Retrieval, and Semantic Textual Similarity (STS) across 123 diverse datasets spanning 14 languages

We support to evaluate model on various datasets by intergrating mteb library. To evaluate the model, run the following command:

python -m lusifer.eval.eval \
    --model_name_or_path Hieuman/LUSIFER \
    --is_lusifer \

Results

We provide the results of LUSIFER on the multilingual text embedding benchmark in the following table. The results are reported in terms of the average main metric across all tasks and datasets. Please refer the paper for the full results

Citation

If you use LUSIFER in your research, please cite the following paper:

@misc{man2025lusiferlanguageuniversalspace,
      title={LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models}, 
      author={Hieu Man and Nghia Trung Ngo and Viet Dac Lai and Ryan A. Rossi and Franck Dernoncourt and Thien Huu Nguyen},
      year={2025},
      eprint={2501.00874},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.00874}, 
}

Bugs or questions?

If you have any questions about the code, feel free to open an issue on the GitHub repository or send me an email at [email protected].

Downloads last month
5
Safetensors
Model size
7.9B params
Tensor type
F32
·
BF16
·
Inference API
Unable to determine this model's library. Check the docs .