allenai/ZebraLogic · Reproducing ZebraLogic results

I've been struggling to reproduce the results in ZeroEval/result_dir/zebra-grid.summary.md. The only difference in configuration is using HuggingFace engine instead of VLLM. Since temperature is set to 0.0, I cannot see where a difference in results could have come from. If I'm making any obvious mistakes I'd be grateful to know!

Myself:

bash zero_eval_local.sh -d zebra-grid -m Qwen/Qwen2-7B-Instruct -p Qwen2-7B-Instruct -s 2 -f hf

ZeroEval/scripts/_ZebraLogic.md:

bash zero_eval_local.sh -d zebra-grid -m Qwen/Qwen2-7B-Instruct -p Qwen2-7B-Instruct -s 4

Model	Mode	N_Mode	N_Size	Puzzle Acc	Small Puzzle Acc	Medium Puzzle Acc	Large Puzzle Acc	XL Puzzle Acc	Cell Acc	No Answer	Total Puzzles	Reason Lens
Qwen2-7B-Instruct (allenai)	greedy	single	1	8.4	26.25	0	0	0	22.06	24.4	1000	1473.23
Qwen2-7B-Instruct (myself)	greedy	single	1	7.3	22.5	0.36	0	0	22.52	24.5	1000	1504.05