Llama-Krikri-8B-Base: A large foundation Language Model for the Greek language

Following the release of Meltemi-7B on the 26th March 2024, we are happy to welcome Krikri to the family of ILSP open Greek LLMs. Krikri is built on top of Llama-3.1-8B, extending its capabilities for Greek through continual pretraining on a large corpus of high-quality and locally relevant Greek texts. We present Llama-Krikri-8B-Base, as well as an instruct version, Llama-Krikri-8B-Instruct.

Model Information

Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
128k context length (approximately 80,000 Greek words)
We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large training corpus.
- This corpus includes 56.7 billion monolingual Greek tokens, constructed from publicly available resources.
- Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (21 billion tokens) and Greek-English parallel data (5.5 billion tokens).
- The training corpus also contains 7.8 billion math and code tokens.
- This corpus has been processed, filtered, and deduplicated to ensure data quality and is outlined below:

Sub-corpus	# Tokens	Percentage
Greek	56.7 B	62.3 %
English	21.0 B	23.1 %
Parallel	5.5 B	6.0 %
Math/Code	7.8 B	8.6 %
Total	91 B	100%

Chosen subsets of the 91 billion corpus were upsampled resulting in a size of 110 billion tokens.

How to use

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"

model = AutoModelForCausalLM.from_pretrained("ilsp/Llama-Krikri-8B-Base")
tokenizer = AutoTokenizer.from_pretrained("ilsp/Llama-Krikri-8B-Base")

model.to(device)

input_text = tokenizer("Ένα κρικρί διαφέρει απο ένα λάμα επειδή", return_tensors='pt').to(device)
outputs = model.generate(input_text['input_ids'], max_new_tokens=256, do_sample=True)

print(tokenizer.batch_decode(outputs)[0])

With OpenAI compatible server via vLLM

vllm serve ilsp/Llama-Krikri-8B-Base \
  --enforce-eager \
  --dtype 'bfloat16' \
  --api-key token-abc123

Then, the model can be used through Python using:

from openai import OpenAI

api_key = "token-abc123"
base_url = "http://localhost:8000/v1"

client = OpenAI(
    api_key=api_key,
    base_url=base_url,
)

response = client.completions.create(model="ilsp/Llama-Krikri-8B-Base",
                                     prompt="Η εκπαίδευση μεγάλων γλωσσικών μοντέλων περιλαμβάνει")
print(response.choices[0].text)

Evaluation

Below, we report improvements of Llama-Krikri-8B-Base over Llama-3.1-8B for Greek and English:

+10.8% on Greek benchmarks
+0.8% on English benchmarks

Our evaluations for Llama-Krikri-8B-Base, Llama-3.1-8B, and Meltemi 7B v1.5 are performed in a few-shot setting, consistent with the settings in the Open LLM leaderboard.

Greek Benchmarks

The evaluation suite we created for the Greek language includes 6 test sets. You can run the suite by cloning this lighteval fork.

Our evaluation suite includes:

Four machine-translated versions (ARC Greek, Truthful QA Greek, HellaSwag Greek, MMLU Greek) of established English benchmarks for language understanding and reasoning (ARC Challenge, Truthful QA, Hellaswag, MMLU).
An existing benchmark for question answering in Greek (Belebele)
A novel benchmark created by the ILSP team for medical question answering based on the medical exams of DOATAP (Medical MCQA).

We can see that our training enhances performance across all Greek test sets by a +10.8% average improvement. The results for the Greek test sets are shown in the following table:

	Medical MCQA EL (15-shot)	Belebele EL (5-shot)	HellaSwag EL (10-shot)	ARC-Challenge EL (25-shot)	TruthfulQA MC2 EL (0-shot)	MMLU EL (5-shot)	Average
Meltemi 7B v1.5	42.2%	61.0%	53.8%	40.0%	49.0%	41.2%	47.9%
Llama-3.1-8B	33.4%	72.8%	52.1%	39.9%	51.1%	42.6%	48.7%
Llama-Krikri-8B	53.8%	82.7%	64.6%	49.4%	54.2%	52.0%	59.5%

English Benchmarks

We can also see that our training methodology not only mitigates catastrophic forgetting effectively, but also improves average performance across all English test sets by +0.8%. The results for the English test sets are shown in the following table:

	Winogrande (5-shot)	Belebele (5-shot)	HellaSwag (10-shot)	ARC-Challenge (25-shot)	TruthfulQA MC2 (0-shot)	MMLU (5-shot)	Average
Meltemi 7B v1.5	73.4%	77.7%	79.6%	54.1%	40.5%	56.9%	63.7%
Llama-3.1-8B	74.6%	71.5%	82.0%	58.5%	44.2%	66.2%	66.2%
Llama-Krikri-8B	72.6%	79.8%	80.7%	57.8%	44.8%	65.1%	67.0%

Please note that all evaluations were run with the latest version of lighteval, which has some differences from past versions. This is why we report different scores for Meltemi-7B-v1.5

Ethical Considerations

This model has not been aligned with human preferences, and therefore might generate misleading, harmful, and toxic content.

Acknowledgements

The ILSP team utilized Amazon's cloud computing services, which were made available via GRNET under the OCRE Cloud framework, providing Amazon Web Services for the Greek Academic and Research Community.

ilsp
/

Llama-Krikri-8B-Base