metadata

license: mit
model-index:
  - name: piccolo-math-2x7b
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 69.11
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/piccolo-math-2x7b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 87.27
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/piccolo-math-2x7b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 63.69
            name: accuracy
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/piccolo-math-2x7b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 63.86
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/piccolo-math-2x7b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 79.87
            name: accuracy
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/piccolo-math-2x7b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 70.13
            name: accuracy
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/piccolo-math-2x7b
          name: Open LLM Leaderboard

Piccolo-math-2x7b

In loving memory of my dog Klaus (Piccolo)

~ Piccolo (Italian): the little one ~

$piccolo.png$

Code Example

Inference and Evaluation colab available here

from transformers import AutoModelForCausalLM, AutoTokenizer

def generate_response(prompt):
    """
    Generate a response from the model based on the input prompt.
    Args:
    prompt (str): Prompt for the model.

    Returns:
    str: The generated response from the model.
    """
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=256, eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id)

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return response

model_id = "macadeliccc/piccolo-math-2x7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id,load_in_4bit=True)

prompt = "What is the best way to train Cane Corsos?"

print("Response:")
print(generate_response(prompt), "\n")

The model is capable of quality code, math, and logical reasoning. Try whatever questions you think of.

Evaluations

Model	AGIEval	GPT4All	TruthfulQA	Bigbench	Average
piccolo-math-2x7b	43.89	74.98	63.96	44.99	56.96

EQ Bench

Benchmark Complete:

2024-01-24 00:00:40
Time taken: 183.3 mins
Prompt Format: Mistral
Model: macadeliccc/piccolo-math-2x7b
Score (v2): 70.74
Parseable: 167.0

Batch completed Time taken: 183.3 mins

AGIEval

Task	Version	Metric	Value		Stderr
agieval_aqua_rat	0	acc	24.41	±	2.70
		acc_norm	24.80	±	2.72
agieval_logiqa_en	0	acc	35.79	±	1.88
		acc_norm	36.71	±	1.89
agieval_lsat_ar	0	acc	23.48	±	2.80
		acc_norm	23.91	±	2.82
agieval_lsat_lr	0	acc	49.22	±	2.22
		acc_norm	50.00	±	2.22
agieval_lsat_rc	0	acc	63.94	±	2.93
		acc_norm	64.31	±	2.93
agieval_sat_en	0	acc	77.18	±	2.93
		acc_norm	76.70	±	2.95
agieval_sat_en_without_passage	0	acc	45.15	±	3.48
		acc_norm	44.66	±	3.47
agieval_sat_math	0	acc	33.64	±	3.19
		acc_norm	30.00	±	3.10

Average: 43.89%

GPT4All

Task	Version	Metric	Value		Stderr
arc_challenge	0	acc	61.86	±	1.42
		acc_norm	62.88	±	1.41
arc_easy	0	acc	84.34	±	0.75
		acc_norm	80.47	±	0.81
boolq	1	acc	86.88	±	0.59
hellaswag	0	acc	68.56	±	0.46
		acc_norm	85.16	±	0.35
openbookqa	0	acc	37.00	±	2.16
		acc_norm	47.80	±	2.24
piqa	0	acc	82.21	±	0.89
		acc_norm	83.68	±	0.86
winogrande	0	acc	77.98	±	1.16

Average: 74.98%

TruthfulQA

Task	Version	Metric	Value		Stderr
truthfulqa_mc	1	mc1	47.37	±	1.75
		mc2	63.96	±	1.57

Average: 63.96%

Bigbench

Task	Version	Metric	Value		Stderr
bigbench_causal_judgement	0	multiple_choice_grade	55.26	±	3.62
bigbench_date_understanding	0	multiple_choice_grade	63.14	±	2.51
bigbench_disambiguation_qa	0	multiple_choice_grade	42.64	±	3.08
bigbench_geometric_shapes	0	multiple_choice_grade	22.84	±	2.22
		exact_str_match	3.34	±	0.95
bigbench_logical_deduction_five_objects	0	multiple_choice_grade	36.60	±	2.16
bigbench_logical_deduction_seven_objects	0	multiple_choice_grade	25.57	±	1.65
bigbench_logical_deduction_three_objects	0	multiple_choice_grade	56.00	±	2.87
bigbench_movie_recommendation	0	multiple_choice_grade	42.40	±	2.21
bigbench_navigate	0	multiple_choice_grade	54.70	±	1.57
bigbench_reasoning_about_colored_objects	0	multiple_choice_grade	62.90	±	1.08
bigbench_ruin_names	0	multiple_choice_grade	53.35	±	2.36
bigbench_salient_translation_error_detection	0	multiple_choice_grade	24.35	±	1.36
bigbench_snarks	0	multiple_choice_grade	62.43	±	3.61
bigbench_sports_understanding	0	multiple_choice_grade	70.28	±	1.46
bigbench_temporal_sequences	0	multiple_choice_grade	41.30	±	1.56
bigbench_tracking_shuffled_objects_five_objects	0	multiple_choice_grade	22.32	±	1.18
bigbench_tracking_shuffled_objects_seven_objects	0	multiple_choice_grade	17.77	±	0.91
bigbench_tracking_shuffled_objects_three_objects	0	multiple_choice_grade	56.00	±	2.87

Average: 44.99%

Average score: 56.96%

Elapsed time: 01:51:53

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	72.32
AI2 Reasoning Challenge (25-Shot)	69.11
HellaSwag (10-Shot)	87.27
MMLU (5-Shot)	63.69
TruthfulQA (0-shot)	63.86
Winogrande (5-shot)	79.87
GSM8k (5-shot)	70.13