Reproducing ZebraLogic results

#4
by js2042 - opened

I've been struggling to reproduce the results in ZeroEval/result_dir/zebra-grid.summary.md. The only difference in configuration is using HuggingFace engine instead of VLLM. Since temperature is set to 0.0, I cannot see where a difference in results could have come from. If I'm making any obvious mistakes I'd be grateful to know!

Myself:

bash zero_eval_local.sh -d zebra-grid -m Qwen/Qwen2-7B-Instruct -p Qwen2-7B-Instruct -s 2 -f hf 

ZeroEval/scripts/_ZebraLogic.md:

bash zero_eval_local.sh -d zebra-grid -m Qwen/Qwen2-7B-Instruct -p Qwen2-7B-Instruct -s 4 
Model Mode N_Mode N_Size Puzzle Acc Small Puzzle Acc Medium Puzzle Acc Large Puzzle Acc XL Puzzle Acc Cell Acc No Answer Total Puzzles Reason Lens
Qwen2-7B-Instruct (allenai) greedy single 1 8.4 26.25 0 0 0 22.06 24.4 1000 1473.23
Qwen2-7B-Instruct (myself) greedy single 1 7.3 22.5 0.36 0 0 22.52 24.5 1000 1504.05

Sign up or log in to comment