PULI-LlumiX-Llama-3.1 8B base (8.03B billion parameter)

Trained with LLaMA-Factory github
The Llama 3.1 8B Instruct model were continual pretrained on Hungarian dataset

Dataset for continued pretraining

Hungarian (8.08 billion words): documents (763K) that exceed 5000 words in length + Hungarian Wikipedia
English: Long Context QA (2 billion words), BookSum (78 million words)

Limitations

max_seq_length = 16 384
bfloat16

Usage with pipeline

from transformers import pipeline, LlamaForCausalLM, AutoTokenizer

model = LlamaForCausalLM.from_pretrained("NYTK/PULI-LlumiX-Llama-3.1")
tokenizer = AutoTokenizer.from_pretrained("NYTK/PULI-LlumiX-Llama-3.1")
prompt = "Elmesélek egy történetet a nyelvtechnológiáról."
generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer, device=0)

print(generator(prompt, max_new_tokens=30)[0]["generated_text"])

NYTK
/

PULI-LlumiX-Llama-3.1

PULI-LlumiX-Llama-3.1 8B base (8.03B billion parameter)

Dataset for continued pretraining

Limitations

Usage with pipeline

Model tree for NYTK/PULI-LlumiX-Llama-3.1