Quantization made by Richard Erkhov.

Llama-3-8B-Instruct-80K-QLoRA-Merged - GGUF

Model creator: https://huggingface.co./namespace-Pt/
Original model: https://huggingface.co./namespace-Pt/Llama-3-8B-Instruct-80K-QLoRA-Merged/

Name	Quant method	Size
Llama-3-8B-Instruct-80K-QLoRA-Merged.Q2_K.gguf	Q2_K	2.96GB
Llama-3-8B-Instruct-80K-QLoRA-Merged.IQ3_XS.gguf	IQ3_XS	3.28GB
Llama-3-8B-Instruct-80K-QLoRA-Merged.IQ3_S.gguf	IQ3_S	3.43GB
Llama-3-8B-Instruct-80K-QLoRA-Merged.Q3_K_S.gguf	Q3_K_S	3.41GB
Llama-3-8B-Instruct-80K-QLoRA-Merged.IQ3_M.gguf	IQ3_M	3.52GB
Llama-3-8B-Instruct-80K-QLoRA-Merged.Q3_K.gguf	Q3_K	3.74GB
Llama-3-8B-Instruct-80K-QLoRA-Merged.Q3_K_M.gguf	Q3_K_M	3.74GB
Llama-3-8B-Instruct-80K-QLoRA-Merged.Q3_K_L.gguf	Q3_K_L	4.03GB
Llama-3-8B-Instruct-80K-QLoRA-Merged.IQ4_XS.gguf	IQ4_XS	4.18GB
Llama-3-8B-Instruct-80K-QLoRA-Merged.Q4_0.gguf	Q4_0	4.34GB
Llama-3-8B-Instruct-80K-QLoRA-Merged.IQ4_NL.gguf	IQ4_NL	4.38GB
Llama-3-8B-Instruct-80K-QLoRA-Merged.Q4_K_S.gguf	Q4_K_S	4.37GB
Llama-3-8B-Instruct-80K-QLoRA-Merged.Q4_K.gguf	Q4_K	4.58GB
Llama-3-8B-Instruct-80K-QLoRA-Merged.Q4_K_M.gguf	Q4_K_M	4.58GB
Llama-3-8B-Instruct-80K-QLoRA-Merged.Q4_1.gguf	Q4_1	4.78GB
Llama-3-8B-Instruct-80K-QLoRA-Merged.Q5_0.gguf	Q5_0	5.21GB
Llama-3-8B-Instruct-80K-QLoRA-Merged.Q5_K_S.gguf	Q5_K_S	5.21GB
Llama-3-8B-Instruct-80K-QLoRA-Merged.Q5_K.gguf	Q5_K	5.34GB
Llama-3-8B-Instruct-80K-QLoRA-Merged.Q5_K_M.gguf	Q5_K_M	5.34GB
Llama-3-8B-Instruct-80K-QLoRA-Merged.Q5_1.gguf	Q5_1	5.65GB
Llama-3-8B-Instruct-80K-QLoRA-Merged.Q6_K.gguf	Q6_K	6.14GB
Llama-3-8B-Instruct-80K-QLoRA-Merged.Q8_0.gguf	Q8_0	7.95GB

Original model description:

license: mit pipeline_tag: text-generation

Llama-3-8B-Instruct-80K-QLoRA-Merged

[Data&Code]

We extend the context length of Llama-3-8B-Instruct to 80K using QLoRA and 3.5K long-context training data synthesized from GPT-4. The entire training cycle is super efficient, which takes 8 hours on a 8xA800 (80G) machine. Yet, the resulted model achieves remarkable performance on a series of downstream long-context evaluation benchmarks.

NOTE: This model is the result of merging meta-llama/Meta-Llama-3-8B-Instruct and namespace-Pt/Llama-3-8B-Instruct-80K-QLoRA.

Evaluation

All the following evaluation results can be reproduced following instructions here.

Needle in a Haystack

We evaluate the model on the Needle-In-A-HayStack task using the official setting. The blue vertical line indicates the training context length, i.e. 80K.

LongBench

We evaluate the model on LongBench using 32K context length and the official prompt template. For meta-llama/Meta-Llama-3-8B-Instruct, we use 8K context length.

Model	Single-Doc QA	Multi-Doc QA	Summarization	Few-Shot Learning	Synthetic	Code	Avg
meta-llama/Meta-Llama-3-8B-Instruct	37.33	36.04	26.83	69.56	37.75	53.24	43.20
gradientai/Llama-3-8B-Instruct-262k	37.29	31.20	26.18	67.25	44.25	62.71	43.73
Llama-3-8B-Instruct-80K-QLoRA-Merged	43.57	43.07	28.93	69.15	48.50	51.95	47.19

InfiniteBench

We evaluate the model on InfiniteBench using 80K context length and the official prompt template. The results of GPT-4 is copied from the paper. For meta-llama/Meta-Llama-3-8B-Instruct, we use 8K context length.

Model	LongBookQA Eng	LongBookSum Eng
GPT-4	22.22	14.73
meta-llama/Meta-Llama-3-8B-Instruct	7.00	16.40
gradientai/Llama-3-8B-Instruct-262k	20.30	10.34
Llama-3-8B-Instruct-80K-QLoRA-Merged	30.92	14.73

Topic Retrieval

We evaluate the model on Topic Retrieval task with [5,10,15,20,25,30,40,50,60,70] topics.

MMLU

We evaluate the model's zero-shot performance on MMLU benchmark as a reflection of its short-context capability.

Model	STEM	Social Sciences	Humanities	Others	Avg
Llama-2-7B-Chat	35.92	54.37	51.74	51.42	47.22
Mistral-7B-v0.2-Instruct	48.79	69.95	64.99	61.64	60.10
meta-llama/Meta-Llama-3-8B-Instruct	53.87	75.66	69.44	69.75	65.91
gradientai/Llama-3-8B-Instruct-262k	52.10	73.26	67.15	69.80	64.34
Llama-3-8B-Instruct-80K-QLoRA-Merged	53.10	73.24	67.32	68.79	64.44

Environment

torch==2.2.2
flash_attn==2.5.6
transformers==4.39.3

Usage

import json
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "namespace-Pt/Llama-3-8B-Instruct-80K-QLoRA-Merged"

torch_dtype = torch.bfloat16
# place the model on GPU
device_map = {"": "cuda"}

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
  model_id, 
  torch_dtype=torch.bfloat16,
  device_map=device_map,
  attn_implementation="flash_attention_2",
).eval()

with torch.no_grad():
  # short context
  messages = [{"role": "user", "content": "Tell me about yourself."}]
  inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True).to("cuda")
  outputs = model.generate(**inputs, max_new_tokens=50)[:, inputs["input_ids"].shape[1]:]
  print(f"Input Length: {inputs['input_ids'].shape[1]}")
  print(f"Output:       {tokenizer.decode(outputs[0])}")

  # long context
  with open("data/narrativeqa.json", encoding="utf-8") as f:
    example = json.load(f)
  messages = [{"role": "user", "content": example["context"]}]
  inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True).to("cuda")
  outputs = model.generate(**inputs, do_sample=False, top_p=1, temperature=1, max_new_tokens=20)[:, inputs["input_ids"].shape[1]:]
  print("*"*20)
  print(f"Input Length: {inputs['input_ids'].shape[1]}")
  print(f"Answers:      {example['answer']}")
  print(f"Prediction:   {tokenizer.decode(outputs[0])}")

You may observe messages like: This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (8192). Depending on the model, you may observe exceptions, performance degradation, or nothing at all. or Setting pad_token_id to eos_token_id:128001 for open-end generation. They do not matter. Just ignore them.