Hugging Face TB Research

Enterprise

community

AI & ML interests

Exploring smol models and high quality web and synthetic datasets, generated by LLMs (TB is for Textbook, as inspired by the "Textbooks are all your need" paper)

Recent Activity

anton-l new activity about 2 hours ago

HuggingFaceTB/finemath:Create ？

anton-l new activity 2 days ago

HuggingFaceTB/finemath:[bot] Conversion to Parquet

anton-l updated a dataset 2 days ago

HuggingFaceTB/math_tasks

View all activity

HuggingFaceTB's activity

anton-l

in HuggingFaceTB/finemath about 2 hours ago

Create ？

#4 opened about 6 hours ago by

Amyww

merve

posted an update about 23 hours ago

Post

1003

QwQ can see 🔥
Qwen team released QvQ, a large vision LM with reasoning 😱

it outperforms proprietary VLMs on several benchmarks, comes with open weights and a demo!
Check them out ⬇️
Demo Qwen/QVQ-72B-preview
Model Qwen/QVQ-72B-Preview
Read more https://qwenlm.github.io/blog/qvq-72b-preview/
Congratulations @JustinLin610 and team!

anton-l

in HuggingFaceTB/finemath 2 days ago

[bot] Conversion to Parquet

#1 opened 6 days ago by

parquet-converter

anton-l

updated a dataset 2 days ago

HuggingFaceTB/math_tasks

Viewer • Updated 2 days ago • 21.3k • 61 • 1

loubnabnl

in HuggingFaceTB/finemath 2 days ago

Why did you use CC rather than FineWeb to create FineMath?

#3 opened 2 days ago by

CryptAL

anton-l

in HuggingFaceTB/finemath 2 days ago

[Bug] cannot get prompts

#2 opened 3 days ago by

BigDong

anton-l

updated a dataset 2 days ago

HuggingFaceTB/finemath

Viewer • Updated 2 days ago • 48.3M • 9.76k • 156

Xenova

in HuggingFaceTB/SmolLM-1.7B 4 days ago

onnx model has additional unknown input

#7 opened 4 days ago by

SantoshHF

davanstrien

posted an update 5 days ago

Post

1512

Introducing FineWeb-C 🌐🎓, a community-built dataset for improving language models in ALL languages.

Inspired by FineWeb-Edu the community is labelling the educational quality of texts for many languages.

318 annotators, 32K+ annotations, 12 languages - and growing! 🌍

data-is-better-together/fineweb-c

loubnabnl

updated a Space 5 days ago

Running

👁

README

anton-l

updated a Space 5 days ago

Running

👁

README

anton-l

posted an update 6 days ago

Post

1964

Introducing 📐𝐅𝐢𝐧𝐞𝐌𝐚𝐭𝐡: the best public math pre-training dataset with 50B+ tokens!
HuggingFaceTB/finemath

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

We build the dataset by:
🛠️ carefully extracting math data from Common Crawl;
🔎 iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.

We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.

We hope this helps advance the performance of LLMs on math and reasoning! 🚀
We’re also releasing all the ablation models as well as the evaluation code.

HuggingFaceTB/finemath-6763fb8f71b6439b653482c2