Journalists on Hugging Face

community

AI & ML interests

Democratizing access to useful AI tools and resources for journalists

Recent Activity

JournalistsonHF's activity

davanstrien 
posted an update 5 days ago
view post
Post
1512
Introducing FineWeb-C 🌐🎓, a community-built dataset for improving language models in ALL languages.

Inspired by FineWeb-Edu the community is labelling the educational quality of texts for many languages.

318 annotators, 32K+ annotations, 12 languages - and growing! 🌍

data-is-better-together/fineweb-c
thomwolf 
posted an update 16 days ago
view post
Post
4333
We are proud to announce HuggingFaceFW/fineweb-2: A sparkling update to HuggingFaceFW/fineweb with 1000s of 🗣️languages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

🥂 FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.

The dataset is released under the permissive 📜 ODC-By 1.0 license, and the 💻 code to reproduce it and our evaluations is public.

We will very soon announce a big community project, and are working on a 📝 blogpost walking you through the entire dataset creation process. Stay tuned!

In the mean time come ask us question on our chat place: HuggingFaceFW/discussion

H/t @guipenedo @hynky @lvwerra as well as @vsabolcec Bettina Messmer @negar-foroutan and @mjaggi
  • 2 replies
·
thomwolf 
posted an update 19 days ago

Best NLP tutorials?

1
#12 opened 20 days ago by
ajwl
ajwl 
in JournalistsonHF/README 20 days ago

Best NLP tutorials?

1
#12 opened 20 days ago by
ajwl
fdaudens 
updated a Space 21 days ago
thomwolf 
posted an update 21 days ago
davanstrien 
posted an update 26 days ago
view post
Post
486
Increasingly, LLMs are becoming very useful for helping scale annotation tasks, i.e. labelling and filtering. When combined with the structured generation, this can be a very scalable way of doing some pre-annotation without requiring a large team of human annotators.

However, there are quite a few cases where it still doesn't work well. This is a nice paper looking at the limitations of LLM as an annotator for Low Resource Languages: On Limitations of LLM as Annotator for Low Resource Languages (2411.17637).

Humans will still have an important role in the loop to help improve models for all languages (and domains).
davanstrien 
posted an update 29 days ago
view post
Post
2471
First dataset for the new Hugging Face Bluesky community organisation: bluesky-community/one-million-bluesky-posts 🦋

📊 1M public posts from Bluesky's firehose API
🔍 Includes text, metadata, and language predictions
🔬 Perfect to experiment with using ML for Bluesky 🤗

Excited to see people build more open tools for a more open social media platform!
davanstrien 
posted an update 30 days ago
view post
Post
1348
The Bluesky AT Protocol unlocks exciting possibilities:
- Building custom feeds using ML
- Creating dashboards for data exploration
- Developing custom models for Bluesky
To gather Bluesky resources on the Hub, I've created a community org: https://huggingface.co./bluesky-community

My first rather modest contribution is a dashboard that shows the number of posts every second. Drinking straight from the firehose API 🚰

bluesky-community/bluesky-posts-over-time
  • 1 reply
·
thomwolf 
posted an update about 1 month ago
BrigitteTousi 
posted an update about 1 month ago
davanstrien 
posted an update about 1 month ago
thomwolf 
posted an update about 1 month ago
fdaudens 
updated a Space about 2 months ago
ctbritt 
in JournalistsonHF/README about 2 months ago
davanstrien 
posted an update about 2 months ago
thomwolf 
posted an update 2 months ago
view post
Post
4114
Parents in the 1990: Teach the kids to code
Parents now: Teach the kids to fix the code when it starts walking around 🤖✨
  • 2 replies
·
erinys 
posted an update 2 months ago
davanstrien 
posted an update 3 months ago