EuroLLM-9B

Community Article Published December 2, 2024
drawing

We’re thrilled to unveil EuroLLM-9B—the most advanced language model of its size developed in Europe to date. Built using the cutting-edge EuroHPC infrastructure, EuroLLM-9B marks a major milestone in our mission to deliver state-of-the-art, multilingual language models tailored to European languages. In this post, we provide an overview of the model and highlight its benchmark performance.

Stay tuned for the upcoming technical report describing all the data and model development details, extra checkpoints, and the future release of even larger, more powerful models!

Pre-trained model: https://huggingface.co./utter-project/EuroLLM-9B
Post-trained model: https://huggingface.co./utter-project/EuroLLM-9B-Instruct

Introduction

While the quality of open-source large language models (LLMs) has been improving rapidly, most are English-centric or support only a limited set of languages, leaving many European languages underserved. To bridge this gap, we launched the EuroLLM project, with the aim of creating a suite of fully open LLMs capable of understanding and generating text across all the 24 official European Union (EU) languages, as well as 11 commercially and strategically important international languages.

Our journey began with the release of EuroLLM-1.7B (see Martins et al., 2024), a compact, efficient model that delivers strong performance in machine translation and ranks competitively in general benchmarks. Today, we are excited to release EuroLLM-9B, which ranks as the best open European-made LLM of its size.

Our work doesn’t stop here—we’re already developing a larger, more powerful model to expand the EuroLLM family.

Languages supported: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Arabic, Catalan, Chinese, Galician, Hindi, Japanese, Korean, Norwegian, Russian, Turkish, and Ukrainian.

Developed by: Unbabel, Instituto Superior Técnico, Instituto de Telecomunicações, University of Edinburgh, Aveni, University of Paris-Saclay, University of Amsterdam, Naver Labs, Sorbonne Université.

Authors: Pedro Henrique Martins, João Alves, Patrick Fernandes, Nuno M. Guerreiro, Ricardo Rei, Duarte M. Alves, José Pombal, Amin Farajian, Manuel Faysse, Mateusz Klimaszewski, Pierre Colombo, Barry Haddow, José G. C. de Souza, Alexandra Birch, André F. T. Martins

Results

We demonstrate the performance of EuroLLM-9B on multiple benchmarks including multilingual general benchmarks (using translations of English benchmarks), machine translation, and English general benchmarks.

EU Languages

image/png Table 1: Comparison of open-weight LLMs on multilingual benchmarks. The borda count corresponds to the average ranking of the models (see (Colombo et al., 2022)). For Arc-challenge, Hellaswag, and MMLU we are using Okapi datasets (Lai et al., 2023) which include 11 languages. For MMLU-Pro and MUSR we translate the English version with Tower (Alves et al., 2024) to 6 EU languages. For WMT24 and FLORES we average the Comet scores of 3 and 46 language pairs, respectively.
* As there are no public versions of the pre-trained models, we evaluated them using the post-trained versions.

The results in Table 1 highlight EuroLLM-9B's superior performance on multilingual tasks compared to other European-developed models (as shown by the Borda count of 1.0), as well as its strong competitiveness with non-European models, achieving results comparable to Gemma-2-9B and outperforming the rest on most benchmarks.

English

image/png

Table 2: Comparison of open-weight LLMs on English general benchmarks.
* As there are no public versions of the pre-trained models, we evaluated them using the post-trained versions.

The results in Table 2 demonstrate EuroLLM's strong performance on English tasks, surpassing most European-developed models and matching the performance of Mistral-7B (obtaining the same Borda count).

Tokenizer

For an LLM to be efficient across a large number of languages, the development of a suitable tokenizer is essential. Thus, we’ve trained a tokenizer with a vocabulary of 128,000 word pieces, focusing primarily on the EU official languages.

image/png Figure 1: Fertility (pieces / word) obtained with the Mistral, LLaMa-3, Gemma, and EuroLLM tokenizers for a subset of the EuroLLM languages. Lower is better.

Pre-training

EuroLLM-9B was trained on approximately 4 trillion tokens, using 400 Nvidia H100 GPUs on the MareNostrum5 supercomputer, thanks to an EuroHPC extreme-scale access grant. The training process was carefully structured into three key phases:

  1. Initial Pre-training (3.6 trillion tokens) This phase includes the warm-up and constant learning rate stages, during which the model is trained on a mixture of web data alongside higher quality sources such as parallel data, Wikipedia, Arxiv, books, and Apollo datasets. This balanced mix helps the model build a strong multilingual foundation.
  2. Annealing (400 billion tokens) During this phase, there is a linear decay of the learning rate and we adjust the data mix to reduce the proportion of web data while increasing the multilingual content. This shift helps the model refine its understanding across diverse languages and domains.
  3. Annealing to Zero (40 billion tokens) In this final stage, the learning rate decays linearly to zero. In this phase, the data mix was optimized to be of even higher quality, in order to polish the model's performance.

Post-training

During post-training, we adapt EuroLLM to be an instruction-following model capable of handling multi-turn conversations. We only use publicly available datasets to fine-tune the model, as we wanted to show how EuroLLM can be easily adapted for your use-cases.

The model excels at translation tasks being capable of translating across all official EU languages, outperforming strong models like Gemma-2-9B–IT and Aya-expanse-8B (instruction tuned versions of Gemma-2-9B and Aya-23-8B). Furthermore, when it comes to general benchmarks, its instruction-following capabilities are second to none when it comes to EU-made models of similar size.

Acknowledgments

We thank EuroHPC for the compute grant that allows us to train the EuroLLM models and Barcelona Super Computer (BSC) for their support. This work was partly supported by the EU’s Horizon Europe Research and Innovation Actions (UTTER, contract 101070631).

References

Duarte M. Alves, José Pombal, Nuno M. Guerreiro, Pedro H. Martins, João Alves, Amin Farajian, Ben Peters, Ricardo Rei, Patrick Fernandes, Sweta Agrawal, Pierre Colombo, José G.C. de Souza, André F.T. Martins. Tower: An Open Multilingual Large Language Model for Translation-Related Tasks. COLM 2024.

Pierre Colombo, Nathan Noiry, Ekhine Irurozki, Stéphan Clémençon. What are the best systems? New perspectives on NLP Benchmarking. NeurIPS 2022.

Viet Lai, Chien Nguyen, Nghia Ngo, Thuat Nguyen, Franck Dernoncourt, Ryan Rossi, Thien Nguyen. Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback. EMNLP System Demonstrations 2023.

Pedro Henrique Martins, Patrick Fernandes, João Alves, Nuno M. Guerreiro, Ricardo Rei, Duarte M. Alves, José Pombal, Amin Farajian, Manuel Faysse, Mateusz Klimaszewski, Pierre Colombo, Barry Haddow, José G. C. de Souza, Alexandra Birch, André F. T. Martins. EuroLLM: Multilingual Language Models for Europe. 2024.