OpenVINO IR model with int8 quantization of LLaMAntino-3-ANITA-8B-Inst-DPO-ITA

Model definition for LocalAI:

name: anita-llama3
backend: transformers
parameters:
  model: fakezeta/LLaMAntino-3-ANITA-8B-Inst-DPO-ITA-ov-int8
context_size: 8192
type: OVModelForCausalLM
template:
  use_tokenizer_template: true

To run the model directly with LocalAI:

local-ai run huggingface://fakezeta/LLaMAntino-3-ANITA-8B-Inst-DPO-ITA-ov-int8/model.yaml
llamantino3_anita aaa

"Built with Meta Llama 3".

LLaMAntino-3-ANITA-8B-Inst-DPO-ITA is a model of the LLaMAntino - Large Language Models family. The model is an instruction-tuned version of Meta-Llama-3-8b-instruct (a fine-tuned LLaMA 3 model). This model version aims to be the a Multilingual Model 🏁 (EN 🇺🇸 + ITA🇮🇹) to further fine-tuning on Specific Tasks in Italian.

The 🌟ANITA project🌟 *(Advanced Natural-based interaction for the ITAlian language)* wants to provide Italian NLP researchers with an improved model for the Italian Language 🇮🇹 use cases.


Live DEMO: https://chat.llamantino.it/
It works only with Italian connection.


Model Details

Last Update: 10/05/2024

https://github.com/marcopoli/LLaMAntino-3-ANITA

Model HF GGUF EXL2
swap-uniba/LLaMAntino-3-ANITA-8B-Inst-DPO-ITA Link Link Link

Specifications

  • Model developers:
    Ph.D. Marco Polignano - University of Bari Aldo Moro, Italy
    SWAP Research Group
  • Variations: The model release has been supervised fine-tuning (SFT) using QLoRA 4bit, on instruction-based datasets. DPO approach over the mlabonne/orpo-dpo-mix-40k dataset is used to align with human preferences for helpfulness and safety.
  • Input: Models input text only.
  • Language: Multilingual 🏁 + Italian 🇮🇹
  • Output: Models generate text and code only.
  • Model Architecture: Llama 3 architecture.
  • Context length: 8K, 8192.
  • Library Used: Unsloth

Playground

To use the model directly, there are many ways to get started, choose one of the following ways to experience it.

Prompt Template

<|start_header_id|>system<|end_header_id|>

{ SYS Prompt }<|eot_id|><|start_header_id|>user<|end_header_id|>

{ USER Prompt }<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{ ASSIST Prompt }<|eot_id|>

Transformers

For direct use with transformers, you can easily get started with the following steps.

  • Firstly, you need to install transformers via the command below with pip.

    pip install -U transformers trl peft accelerate bitsandbytes
    
  • Right now, you can start using the model directly.

    import torch
    from transformers import (
        AutoModelForCausalLM,
        AutoTokenizer,
    )
    
    base_model = "swap-uniba/LLaMAntino-3-ANITA-8B-Inst-DPO-ITA"
    model = AutoModelForCausalLM.from_pretrained(
        base_model,
        torch_dtype=torch.bfloat16,
        device_map="auto",
    )
    tokenizer = AutoTokenizer.from_pretrained(base_model)
    
    sys = "Sei un an assistente AI per la lingua Italiana di nome LLaMAntino-3 ANITA " \
        "(Advanced Natural-based interaction for the ITAlian language)." \
        " Rispondi nella lingua usata per la domanda in modo chiaro, semplice ed esaustivo."
    
    messages = [
        {"role": "system", "content": sys},
        {"role": "user", "content": "Chi è Carlo Magno?"}
    ]
    
    #Method 1
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
    for k,v in inputs.items():
        inputs[k] = v.cuda()
    outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, top_p=0.9, temperature=0.6)
    results = tokenizer.batch_decode(outputs)[0]
    print(results)
    
    #Method 2
    import transformers
    pipe = transformers.pipeline(
        model=model,
        tokenizer=tokenizer,
        return_full_text=False, # langchain expects the full text
        task='text-generation',
        max_new_tokens=512, # max number of tokens to generate in the output
        temperature=0.6,  #temperature for more or less creative answers
        do_sample=True,
        top_p=0.9,
    )
    
    sequences = pipe(messages)
    for seq in sequences:
        print(f"{seq['generated_text']}")
    
  • Additionally, you can also use a model with 4bit quantization to reduce the required resources at least. You can start with the code below.

    import torch
    from transformers import (
        AutoModelForCausalLM,
        AutoTokenizer,
        BitsAndBytesConfig,
    )
    
    base_model = "swap-uniba/LLaMAntino-3-ANITA-8B-Inst-DPO-ITA"
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=False,
    )
    model = AutoModelForCausalLM.from_pretrained(
        base_model,
        quantization_config=bnb_config,
        device_map="auto",
    )
    tokenizer = AutoTokenizer.from_pretrained(base_model)
    
    sys = "Sei un an assistente AI per la lingua Italiana di nome LLaMAntino-3 ANITA " \
        "(Advanced Natural-based interaction for the ITAlian language)." \
        " Rispondi nella lingua usata per la domanda in modo chiaro, semplice ed esaustivo."
    
    messages = [
        {"role": "system", "content": sys},
        {"role": "user", "content": "Chi è Carlo Magno?"}
    ]
    
    #Method 1
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
    for k,v in inputs.items():
        inputs[k] = v.cuda()
    outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, top_p=0.9, temperature=0.6)
    results = tokenizer.batch_decode(outputs)[0]
    print(results)
    
    #Method 2
    import transformers
    pipe = transformers.pipeline(
        model=model,
        tokenizer=tokenizer,
        return_full_text=False, # langchain expects the full text
        task='text-generation',
        max_new_tokens=512, # max number of tokens to generate in the output
        temperature=0.6,  #temperature for more or less creative answers
        do_sample=True,
        top_p=0.9,
    )
    
    sequences = pipe(messages)
    for seq in sequences:
        print(f"{seq['generated_text']}")
    

Evaluation

Open LLM Leaderboard:

Evaluated with lm-evaluation-benchmark-harness for the Open Italian LLMs Leaderboard

   lm_eval --model hf --model_args pretrained=HUGGINGFACE_MODEL_ID  --tasks hellaswag_it,arc_it  --device cuda:0 --batch_size auto:2
   lm_eval --model hf --model_args pretrained=HUGGINGFACE_MODEL_ID  --tasks m_mmlu_it --num_fewshot 5  --device cuda:0 --batch_size auto:2 
Metric Value
Avg. 0.6160
Arc_IT 0.5714
Hellaswag_IT 0.7093
MMLU_IT 0.5672

Unsloth

Unsloth, a great tool that helps us easily develop products, at a lower cost than expected.

Citation instructions

@misc{polignano2024advanced,
      title={Advanced Natural-based interaction for the ITAlian language: LLaMAntino-3-ANITA}, 
      author={Marco Polignano and Pierpaolo Basile and Giovanni Semeraro},
      year={2024},
      eprint={2405.07101},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
@misc{basile2023llamantino,
      title={LLaMAntino: LLaMA 2 Models for Effective Text Generation in Italian Language}, 
      author={Pierpaolo Basile and Elio Musacchio and Marco Polignano and Lucia Siciliani and Giuseppe Fiameni and Giovanni Semeraro},
      year={2023},
      eprint={2312.09993},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
@article{llama3modelcard,
  title={Llama 3 Model Card},
  author={AI@Meta},
  year={2024},
  url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
}

Acknowledgments

We acknowledge the support of the PNRR project FAIR - Future AI Research (PE00000013), Spoke 6 - Symbiotic AI (CUP H97G22000210007) under the NRRP MUR program funded by the NextGenerationEU. Models are built on the Leonardo supercomputer with the support of CINECA-Italian Super Computing Resource Allocation, class C project IscrC_Pro_MRS (HP10CQO70G).

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 75.12
AI2 Reasoning Challenge (25-Shot) 74.57
HellaSwag (10-Shot) 92.75
MMLU (5-Shot) 66.85
TruthfulQA (0-shot) 75.93
Winogrande (5-shot) 82.00
GSM8k (5-shot) 58.61
Downloads last month
17
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for fakezeta/LLaMAntino-3-ANITA-8B-Inst-DPO-ITA-ov-int8

Finetuned
(529)
this model

Datasets used to train fakezeta/LLaMAntino-3-ANITA-8B-Inst-DPO-ITA-ov-int8

Collection including fakezeta/LLaMAntino-3-ANITA-8B-Inst-DPO-ITA-ov-int8

Evaluation results