Elie Bakouch

eliebak

AI & ML interests

Training LLM's @ πŸ€—

Recent Activity

Articles

Organizations

Hugging Face's profile picture HuggingFaceBR4's profile picture Hugging Face H4's profile picture Blog-explorers's profile picture Hugging Face TB Research's profile picture huggingPartyParis's profile picture Nanotron Research's profile picture Hugging Face SMOL's profile picture MLX Community's profile picture HuggingFaceFW's profile picture LLHF's profile picture llmc's profile picture SLLHF's profile picture Argilla Warehouse's profile picture nltpt's profile picture smol-explorers's profile picture Open Science's profile picture Hugging Face Science's profile picture open/ acc's profile picture

eliebak's activity

upvoted an article 1 day ago
view article
Article

🌁#81: Key AI Concepts to Follow in 2025

By Kseniase β€’
β€’ 13
reacted to anton-l's post with πŸ”₯ 6 days ago
view post
Post
1965
Introducing πŸ“π…π’π§πžπŒπšπ­π‘: the best public math pre-training dataset with 50B+ tokens!
HuggingFaceTB/finemath

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

We build the dataset by:
πŸ› οΈ carefully extracting math data from Common Crawl;
πŸ”Ž iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.

We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.

We hope this helps advance the performance of LLMs on math and reasoning! πŸš€
We’re also releasing all the ablation models as well as the evaluation code.

HuggingFaceTB/finemath-6763fb8f71b6439b653482c2