Phi 3 Mini 4K Instruct GGUF

Updated with Microsoft’s latest model changes as of July 21, 2024

Original model: Phi-3-mini-4k-instruct

Model creator: Microsoft

This repo contains GGUF format model files for Microsoft’s Phi 3 Mini 4K Instruct.

The Phi-3-Mini-4K-Instruct is a 3.8B parameters, lightweight, state-of-the-art open model trained with the Phi-3 datasets that includes both synthetic data and the filtered publicly available websites data with a focus on high-quality and reasoning dense properties.

Learn more on Microsoft’s Model page.

What is GGUF?

GGUF is a file format for representing AI models. It is the third version of the format, introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Converted with llama.cpp build 3432 (revision 45f2c19), using autogguf.

Prompt template

<|system|>
{{system_prompt}}<|end|>
<|user|>
{{prompt}}<|end|>
<|assistant|>

Download & run with cnvrs on iPhone, iPad, and Mac!

cnvrs is the best app for private, local AI on your device:

create & save Characters with custom system prompts & temperature settings
download and experiment with any GGUF model you can find on HuggingFace!
make it your own with custom Theme colors
powered by Metal ⚡️ & Llama.cpp, with haptics during response streaming!
try it out yourself today, on Testflight!
follow cnvrs on twitter to stay up to date

Original Model Evaluation

Comparison of July update vs original April release:

Benchmarks	Original	June 2024 Update
Instruction Extra Hard	5.7	6.0
Instruction Hard	4.9	5.1
Instructions Challenge	24.6	42.3
JSON Structure Output	11.5	52.3
XML Structure Output	14.4	49.8
GPQA	23.7	30.6
MMLU	68.8	70.9
Average	21.9	36.7

Original April release

As is now standard, we use few-shot prompts to evaluate the models, at temperature 0. The prompts and number of shots are part of a Microsoft internal tool to evaluate language models, and in particular we did no optimization to the pipeline for Phi-3. More specifically, we do not change prompts, pick different few-shot examples, change prompt format, or do any other form of optimization for the model.

The number of k–shot examples is listed per-benchmark.

	Phi-3-Mini-4K-In 3.8b	Phi-2 2.7b	Mistral 7b	Gemma 7b	Llama-3-In 8b	Mixtral 8x7b	GPT-3.5 version 1106
MMLU 5-Shot	68.8	56.3	61.7	63.6	66.5	68.4	71.4
HellaSwag 5-Shot	76.7	53.6	58.5	49.8	71.1	70.4	78.8
ANLI 7-Shot	52.8	42.5	47.1	48.7	57.3	55.2	58.1
GSM-8K 0-Shot; CoT	82.5	61.1	46.4	59.8	77.4	64.7	78.1
MedQA 2-Shot	53.8	40.9	49.6	50.0	60.5	62.2	63.4
AGIEval 0-Shot	37.5	29.8	35.1	42.1	42.0	45.2	48.4
TriviaQA 5-Shot	64.0	45.2	72.3	75.2	67.7	82.2	85.8
Arc-C 10-Shot	84.9	75.9	78.6	78.3	82.8	87.3	87.4
Arc-E 10-Shot	94.6	88.5	90.6	91.4	93.4	95.6	96.3
PIQA 5-Shot	84.2	60.2	77.7	78.1	75.7	86.0	86.6
SociQA 5-Shot	76.6	68.3	74.6	65.5	73.9	75.9	68.3
BigBench-Hard 0-Shot	71.7	59.4	57.3	59.6	51.5	69.7	68.32
WinoGrande 5-Shot	70.8	54.7	54.2	55.6	65	62.0	68.8
OpenBookQA 10-Shot	83.2	73.6	79.8	78.6	82.6	85.8	86.0
BoolQ 0-Shot	77.6	--	72.2	66.0	80.9	77.6	79.1
CommonSenseQA 10-Shot	80.2	69.3	72.6	76.2	79	78.1	79.6
TruthfulQA 10-Shot	65.0	--	52.1	53.0	63.2	60.1	85.8
HumanEval 0-Shot	59.1	47.0	28.0	34.1	60.4	37.8	62.2
MBPP 3-Shot	53.8	60.6	50.8	51.5	67.7	60.2	77.8