91 8 51

Anton Lozhkov

anton-l

AI & ML interests

Generative Models, Distributed Training, Photo and Video Enhancement

Recent Activity

new activity about 6 hours ago

HuggingFaceTB/finemath:Create ？

new activity 2 days ago

HuggingFaceTB/finemath:[bot] Conversion to Parquet

updated a dataset 2 days ago

HuggingFaceTB/math_tasks

View all activity

Articles

SmolLM - blazingly fast and remarkably powerful

Jul 16

• 292

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Mar 20

• 69

Organizations

anton-l's activity

New activity in HuggingFaceTB/finemath about 6 hours ago

Create ？

#4 opened about 9 hours ago by

Amyww

New activity in HuggingFaceTB/finemath 2 days ago

[bot] Conversion to Parquet

#1 opened 6 days ago by

parquet-converter

updated a dataset 2 days ago

HuggingFaceTB/math_tasks

Viewer • Updated 2 days ago • 21.3k • 61 • 1

New activity in HuggingFaceTB/finemath 2 days ago

[Bug] cannot get prompts

#2 opened 3 days ago by

BigDong

updated a dataset 2 days ago

HuggingFaceTB/finemath

Viewer • Updated 2 days ago • 48.3M • 9.76k • 160

updated a Space 5 days ago

Running

👁

HuggingFaceFW/fineweb-edu

Viewer • Updated 5 days ago • 3B • 329k • 571

liked a dataset 6 days ago

HuggingFaceTB/finemath

Viewer • Updated 2 days ago • 48.3M • 9.76k • 160

posted an update 6 days ago

Post

1965

Introducing 📐𝐅𝐢𝐧𝐞𝐌𝐚𝐭𝐡: the best public math pre-training dataset with 50B+ tokens!
HuggingFaceTB/finemath

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

We build the dataset by:
🛠️ carefully extracting math data from Common Crawl;
🔎 iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.

We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.

We hope this helps advance the performance of LLMs on math and reasoning! 🚀
We’re also releasing all the ablation models as well as the evaluation code.

HuggingFaceTB/finemath-6763fb8f71b6439b653482c2