blog-explorers (Blog-explorers)

posted an update about 14 hours ago

Post

729

🌍 Big step for multilingual AI data!

The Hugging Face community has rated educational content in languages spoken by 1.6 billion people! New additions:
• Japanese
• Italian
• Old High German

Learn more and contribute: https://huggingface.co./blog/davanstrien/fineweb2-community

These ratings can help enhance training data for major world languages.

davidberenstein1957

posted an update about 18 hours ago

Post

919

Let's uncover the post-training dataset from DeepSeek-R1 with Magpie!

Pass pre-query tokens <｜begin▁of▁sentence｜>User: , let the model generate the rest.

We can get realistic examples!

Gist: https://gist.github.com/davidberenstein1957/3f20046ce57395a6aba13f8b4e956b59

5 replies

·

mrfakename

posted an update 4 days ago

Post

671

I’m excited to introduce a new leaderboard UI + keyboard shortcuts on the TTS Arena!

The refreshed UI for the leaderboard is smoother and (hopefully) more intuitive. You can now view models based on a simpler win-rate percentage and exclude closed models.

In addition, the TTS Arena now supports keyboard shortcuts. This should make voting much more efficient as you can now vote without clicking anything!

In both the normal Arena and Battle Mode, press "r" to select a random text, Cmd/Ctrl + Enter to synthesize, and "a"/"b" to vote! View more details about keyboard shortcuts by pressing "?" (Shift + /) on the Arena.

Check out all the new updates on the TTS Arena:

TTS-AGI/TTS-Arena

davidberenstein1957

posted an update 7 days ago

Post

1835

The RAG's in the bag!

You can now use the Synthetic Data Generator with your own domain-specific seed data to generate a dataset for fine-tuning retrieval or reranking models.

GitHub: https://buff.ly/49IDSmd
Spaces: https://buff.ly/3Y1S99z
Blog: https://huggingface.co./blog/sdiazlor/fine-tune-modernbert-for-rag-with-synthetic-data

1 reply

·

julien-c

in blog-explorers/README 7 days ago

[Support] Community Articles

75

#5 opened 10 months ago by

victor

davidberenstein1957

posted an update 11 days ago

Post

1227

You can now use the "Synthetic Data Generator" at a much larger scale with your preferred inference engine: Ollama, vLLM, TGI, and serverless inference! 🔥

Install, configure, launch!

Space: argilla/synthetic-data-generator
Examples: https://github.com/argilla-io/synthetic-data-generator/tree/main/examples

Xenova

posted an update 11 days ago

Post

3362

Introducing Kokoro.js, a new JavaScript library for running Kokoro TTS, an 82 million parameter text-to-speech model, 100% locally in the browser w/ WASM. Powered by 🤗 Transformers.js. WebGPU support coming soon!
👉 npm i kokoro-js 👈

Try it out yourself: webml-community/kokoro-web
Link to models/samples: onnx-community/Kokoro-82M-ONNX

You can get started in just a few lines of code!

import { KokoroTTS } from "kokoro-js";

const tts = await KokoroTTS.from_pretrained(
  "onnx-community/Kokoro-82M-ONNX",
  { dtype: "q8" }, // fp32, fp16, q8, q4, q4f16
);

const text = "Life is like a box of chocolates. You never know what you're gonna get.";
const audio = await tts.generate(text,
  { voice: "af_sky" }, // See `tts.list_voices()`
);
audio.save("audio.wav");

Huge kudos to the Kokoro TTS community, especially taylorchu for the ONNX exports and Hexgrad for the amazing project! None of this would be possible without you all! 🤗

The model is also extremely resilient to quantization. The smallest variant is only 86 MB in size (down from the original 326 MB), with no noticeable difference in audio quality! 🤯

4 replies

·

davidberenstein1957

posted an update 14 days ago

Post

2094

🔦 What? The Hub as a vector search backend!

code: https://gist.github.com/davidberenstein1957/f0157a471ec59d9dd44ae6957f1d52ec
build on DuckDB: https://huggingface.co./docs/hub/en/datasets-duckdb

meg

posted an update 14 days ago

Post

2937

💫...And we're live!💫 Seasonal newsletter from ethicsy folks at Hugging Face, exploring the ethics of "AI Agents"
https://huggingface.co./blog/ethics-soc-7
Our analyses found:
- There's a spectrum of "agent"-ness
- *Safety* is a key issue, leading to many other value-based concerns
Read for details & what to do next!
With @evijit , @giadap , and @sasha

davanstrien

posted an update 14 days ago

Post

3032

Introducing scandi-fine-web-cleaner davanstrien/scandi-fine-web-cleaner, the first model trained on FineWeb-C community annotations!

FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it?

Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative.

Today, I'm happy to share the first classifier trained on this data.

🔍 What we've built:

- A lightweight classifier that efficiently removes low-quality content
- 90%+ precision demonstrated on Danish & Swedish
- Can process the 43M+ documents in Danish FineWeb2 with minimal compute

🌍 Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C ( data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers.

Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html

1 reply

·

davanstrien

posted an update 18 days ago

Post

2200

The data-is-better-together/fineweb-c dataset is growing!

This week a few more languages have got 1,000 annotations for the educational quality of data from HuggingFaceFW/fineweb-2.

Why should you care?

The quality of pre-training data can have a big impact on the performance of downstream language models trained on that data ( HuggingFaceFW/blogpost-fineweb-v1).

Being able to filter by educational quality is on way of improving the quality of the data you use for training an LLM. Very importantly this approach can also reduce the amount of data needed for pertaining.

Why not use an LLM?

LLMs can be used to annotate educational quality for a subset of data. This data can then be used to train a smaller encoder only model to label the full dataset. However, this may not work well for languages outside of english. This is where fineweb-c (community) comes in.

The community is annotating the educational quality of fineweb2 data. Currently 114 languages have some annotations. These annotations will enable a number of things:

- Evaluate whether an LLM can label the educational quality for texts in that language well
- Directly be used for training quality classifiers
- Help discover other rules and huerisitcs for refining fineweb2 further for different languages.

This week the following languages where done:

Swedish thanks to: @Lauler @AntonVic @ohallstrom @bjarlestam @menbom @Ekgren @apsod

Ukrainian thanks to: @hannayukhymenko @robinhad @realPivo @RabotiahovDmytro @reciprocate

Assamese thanks to: @moyoor97 @Arpanjyoti @nawaf-helmi123 @pahigogoi1 @aelhence @kishorekashyap

Want to learn more: https://huggingface.co./blog/davanstrien/fineweb2-community

Contribute yourself here: data-is-better-together/fineweb-c

1 reply

·

lunarflu

in blog-explorers/README 21 days ago

[Support] Community Articles

75

#5 opened 10 months ago by

victor

MaziyarPanahi

in blog-explorers/README 21 days ago

[Support] Community Articles

75

#5 opened 10 months ago by

victor

akkasayaz

in blog-explorers/README 22 days ago

[Support] Community Articles

75

#5 opened 10 months ago by

victor

davidberenstein1957

posted an update 24 days ago

Post

1939

Fine-tune a SmolLM on domain-specific synthetic data from a LLM

Blog: https://huggingface.co./blog/davidberenstein1957/fine-tune-a-smollm-on-synthetic-data-of-llm

1 reply

·

wolfram

in blog-explorers/README 25 days ago

[Support] Community Articles

75

#5 opened 10 months ago by

victor

Xenova

posted an update 26 days ago

Post

7310

First project of 2025: Vision Transformer Explorer

I built a web app to interactively explore the self-attention maps produced by ViTs. This explains what the model is focusing on when making predictions, and provides insights into its inner workings! 🤯

Try it out yourself! 👇
webml-community/attention-visualization

Source code: https://github.com/huggingface/transformers.js-examples/tree/main/attention-visualization

davidberenstein1957

posted an update 29 days ago

Post

2001

Fine-tuning ModernBERT for text classification using synthetic data generation

From prompt to model in 3 steps.
1 dataset description
20 minutes of generating
60 minutes of fine-tuning on my Macbook Pro

Tutorial: https://nbsanity.com/static/552eb50cbd91bedb4e5b73fddca2664a/fine-tune-modernbert-classifier.html

davanstrien

posted an update about 1 month ago

Post

3192

🇸🇰 Hovorte po slovensky? Help build better AI for Slovak!

We only need 90 more annotations to include Slovak in the next Hugging Face FineWeb2-C dataset ( data-is-better-together/fineweb-c) release!

Your contribution will help create better language models for 5+ million Slovak speakers.

Annotate here: data-is-better-together/fineweb-c.

Read more about why we're doing it: https://huggingface.co./blog/davanstrien/fineweb2-community

3 replies

·

davanstrien

posted an update about 1 month ago

Post

1787

Introducing FineWeb-C 🌐🎓, a community-built dataset for improving language models in ALL languages.

Inspired by FineWeb-Edu the community is labelling the educational quality of texts for many languages.

318 annotators, 32K+ annotations, 12 languages - and growing! 🌍

data-is-better-together/fineweb-c

Blog-explorers

AI & ML interests

Recent Activity

blog-explorers's activity

[Support] Community Articles

[Support] Community Articles

[Support] Community Articles

[Support] Community Articles

[Support] Community Articles

AI & ML interests

Recent Activity

Team members 673

blog-explorers's activity

[Support] Community Articles

[Support] Community Articles

[Support] Community Articles

[Support] Community Articles

[Support] Community Articles