Journalists on Hugging Face

community

Activity Feed Request to join this org

AI & ML interests

Democratizing access to useful AI tools and resources for journalists

Recent Activity

fdaudens new activity 20 days ago

JournalistsonHF/README:Best NLP tutorials?

ajwl new activity 20 days ago

JournalistsonHF/README:Best NLP tutorials?

fdaudens updated a Space 21 days ago

JournalistsonHF/README

View all activity

JournalistsonHF's activity

davanstrien

posted an update 5 days ago

Post

1512

Introducing FineWeb-C 🌐🎓, a community-built dataset for improving language models in ALL languages.

Inspired by FineWeb-Edu the community is labelling the educational quality of texts for many languages.

318 annotators, 32K+ annotations, 12 languages - and growing! 🌍

data-is-better-together/fineweb-c

thomwolf

posted an update 16 days ago

Post

4333

We are proud to announce HuggingFaceFW/fineweb-2: A sparkling update to HuggingFaceFW/fineweb with 1000s of 🗣️languages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

🥂 FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.

The dataset is released under the permissive 📜 ODC-By 1.0 license, and the 💻 code to reproduce it and our evaluations is public.

We will very soon announce a big community project, and are working on a 📝 blogpost walking you through the entire dataset creation process. Stay tuned!

In the mean time come ask us question on our chat place: HuggingFaceFW/discussion

H/t @guipenedo @hynky @lvwerra as well as @vsabolcec Bettina Messmer @negar-foroutan and @mjaggi

2 replies

thomwolf

posted an update 19 days ago

Post

886

Exponentially growing number of open-source AI models over the course of the past 30 months – from a few thousands to over 1 million and more

Interactive data viz: huggingface/open-source-ai-year-in-review-2024

fdaudens

in JournalistsonHF/README 20 days ago

Best NLP tutorials?

#12 opened 20 days ago by

ajwl

in JournalistsonHF/README 20 days ago

Best NLP tutorials?

#12 opened 20 days ago by

ajwl

fdaudens

updated a Space 21 days ago

Running

😻

README

thomwolf

posted an update 21 days ago

Post

1356

Most liked and most downloaded open-source AI models from 2022 to 2024

Interactive viz: https://aiworld.eu/embed/model/model/treemap
Discussion: huggingface/open-source-ai-year-in-review-2024

davanstrien

posted an update 26 days ago

Post

486

Increasingly, LLMs are becoming very useful for helping scale annotation tasks, i.e. labelling and filtering. When combined with the structured generation, this can be a very scalable way of doing some pre-annotation without requiring a large team of human annotators.

However, there are quite a few cases where it still doesn't work well. This is a nice paper looking at the limitations of LLM as an annotator for Low Resource Languages: On Limitations of LLM as Annotator for Low Resource Languages (2411.17637).

Humans will still have an important role in the loop to help improve models for all languages (and domains).

davanstrien

posted an update 29 days ago

Post

2471

First dataset for the new Hugging Face Bluesky community organisation: bluesky-community/one-million-bluesky-posts 🦋

📊 1M public posts from Bluesky's firehose API
🔍 Includes text, metadata, and language predictions
🔬 Perfect to experiment with using ML for Bluesky 🤗

Excited to see people build more open tools for a more open social media platform!

davanstrien

posted an update 30 days ago

Post

1348

The Bluesky AT Protocol unlocks exciting possibilities:
- Building custom feeds using ML
- Creating dashboards for data exploration
- Developing custom models for Bluesky
To gather Bluesky resources on the Hub, I've created a community org: https://huggingface.co./bluesky-community

My first rather modest contribution is a dashboard that shows the number of posts every second. Drinking straight from the firehose API 🚰

bluesky-community/bluesky-posts-over-time

1 reply

thomwolf

posted an update about 1 month ago

Post

1639

Interesting long read from @evanmiller-anthropic on having a better founded statistical approach to Language Model Evaluations:
https://www.anthropic.com/research/statistical-approach-to-model-evals

Worth a read if you're into LLM evaluations!

Cc @clefourrier

1 reply

BrigitteTousi

posted an update about 1 month ago

Post

929

I'm biased but I think HF Posts is the #1 social platform for the AI community! 🤗 That being said, most of us are already on X and now also joining Bluesky.

Looking for us on Bsky? We started a team list here: https://bsky.app/starter-pack/did:plc:yyfrnpcutxghwc6eac4xplwp/3lbem54cnxp26

davanstrien

posted an update about 1 month ago

Post

1306

huggingface.co/DIBT is dead!

Long live https://huggingface.co./data-is-better-together!

We're working on some very cool projects so we're doing a bit of tidying of the Data is Better Together Hub org 🤓

thomwolf

posted an update about 1 month ago

Post

1413

Very exciting new mistralai/Pixtral-Large-Instruct-2411 model from Mistral-AI

Impressive performances, huge congrats @patrickvonplaten @sgvaze @pandora-s @devendrachaplot @sophiamyang and team!

Very nice to have SOTA Multilingual OCR and Chart understanding in an open-weights model

fdaudens

updated a Space about 2 months ago

Running

📉

Ai Scraper

ctbritt

in JournalistsonHF/README about 2 months ago

Hi! Introduce yourself! 👋

#2 opened 8 months ago by

fdaudens

davanstrien

posted an update about 2 months ago

Post

2527

Excited to see my weird davanstrien/ufo-ColPali dataset featured in a video by @sabrinaesaquino !

The video covers using ColPali with Binary Quantization in Qdant to accelerate retrieval. 2x speed up with no performance drop in results 🛸

Video: https://youtu.be/_A90A-grwIc?si=oB3JAhJG8VQUZGLz
Blog post: https://danielvanstrien.xyz/posts/post-with-code/colpali-qdrant/2024-10-02_using_colpali_with_qdrant.html

2 replies

thomwolf

posted an update 2 months ago

Post

4114

Parents in the 1990: Teach the kids to code
Parents now: Teach the kids to fix the code when it starts walking around 🤖✨

2 replies

erinys

posted an update 2 months ago

Post

2151

🌍 Super cool visualization of global PUT requests to Hugging Face over 24 hours, coded by object size, thanks to @port8080 !

We're putting this analysis to work to help us architect a more geo-distributed system for the HF storage backend.

Originally shared on LinkedIn: https://www.linkedin.com/posts/ajitbanerjee_one-of-the-joys-of-working-on-the-xethub-activity-7252688424732614656-tFGD

davanstrien

posted an update 3 months ago

Post

1245

ColPali is an exciting new approach to multimodal document retrieval, but some doubt its practical use with existing vector DBs.

It turns out it's super easy to use Qdrant to index and search ColPali embeddings efficiently.

Blog post here: https://danielvanstrien.xyz/posts/post-with-code/colpali-qdrant/2024-10-02_using_colpali_with_qdrant.html

Very silly demo: davanstrien/ufo-ColPali-Search

AI & ML interests

Recent Activity

Team members 325

JournalistsonHF's activity

Best NLP tutorials?

Best NLP tutorials?

README

Ai Scraper

Hi! Introduce yourself! 👋