Csaba  Kecskemeti's picture

Csaba Kecskemeti PRO

csabakecskemeti

AI & ML interests

None yet

Recent Activity

updated a model 20 minutes ago
DevQuasar/deepseek-ai.DeepSeek-R1-Zero-bf16
updated a model about 2 hours ago
DevQuasar/Qwen.Qwen2.5-14B-Instruct-1M-GGUF
updated a model about 8 hours ago
DevQuasar/Qwen.Qwen2.5-7B-Instruct-1M-GGUF
View all activity

Organizations

Zillow's profile picture DevQuasar's profile picture Hugging Face Party @ PyTorch Conference's profile picture Intelligent Estate's profile picture open/ acc's profile picture

csabakecskemeti's activity

replied to their post about 22 hours ago
view reply

I've rerun hellaswag with the suggested config, the results haven't improved:

Tasks Version Filter n-shot Metric Value Stderr
hellaswag 1 none 0 acc ↑ 0.5559 ± 0.0050
none 0 acc_norm ↑ 0.7436 ± 0.0044

command:
accelerate launch -m lm_eval --model hf --model_args pretrained=deepseek-ai/DeepSeek-R1-Distill-Llama-8B,parallelize=True,dtype="float16" --tasks hellaswag --batch_size auto:4 --log_samples --output_path eval_results --gen_kwargs temperature=0.6,top_p=0.95,generate_until=64,do_sample=True

replied to their post 1 day ago
view reply

I've missed this suggested configuration from the model card:
"For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1."

Thanks for @shb777 and @bin110 to pointing this out!

replied to their post 1 day ago
replied to their post 1 day ago
posted an update 1 day ago
view post
Post
1923
I've run the open llm leaderboard evaluations + hellaswag on deepseek-ai/DeepSeek-R1-Distill-Llama-8B and compared to meta-llama/Llama-3.1-8B-Instruct and at first glance R1 do not beat Llama overall.

If anyone wants to double check the results are posted here:
https://github.com/csabakecskemeti/lm_eval_results

Am I made some mistake, or (at least this distilled version) not as good/better than the competition?

I'll run the same on the Qwen 7B distilled version too.
·