Data Is Better Together

community
Activity Feed

AI & ML interests

Building better datasets together

Recent Activity

data-is-better-together's activity

sayakpaul 
posted an update 1 day ago
davanstrien 
posted an update 5 days ago
view post
Post
1512
Introducing FineWeb-C 🌐🎓, a community-built dataset for improving language models in ALL languages.

Inspired by FineWeb-Edu the community is labelling the educational quality of texts for many languages.

318 annotators, 32K+ annotations, 12 languages - and growing! 🌍

data-is-better-together/fineweb-c
burtenshaw 
posted an update 6 days ago
view post
Post
2540
People are flexing their end of year stats, so I made this app to show hub stats in a tidy design!

Thanks @Ameeeee and @jfcalvo for the feature from Argilla!
burtenshaw/recap
  • 1 reply
·
davidberenstein1957 
posted an update 7 days ago
sayakpaul 
posted an update 7 days ago
view post
Post
1550
In the past seven days, the Diffusers team has shipped:

1. Two new video models
2. One new image model
3. Two new quantization backends
4. Three new fine-tuning scripts
5. Multiple fixes and library QoL improvements

Coffee on me if someone can guess 1 - 4 correctly.
  • 1 reply
·
nataliaElv 
posted an update 8 days ago
view post
Post
1598
If you are still wondering how the FineWeb2 annotations are done, how to follow the guidelines or how Argilla works, this is your video!

I go through a few samples of the FineWeb2 dataset and classify them based on their educational content. Check it out!

https://www.youtube.com/watch?v=_-ORB4WAVGU
davidberenstein1957 
posted an update 9 days ago
view post
Post
4106
Introducing the Synthetic Data Generator, a user-friendly application that takes a no-code approach to creating custom datasets with Large Language Models (LLMs). The best part: A simple step-by-step process, making dataset creation a non-technical breeze, allowing anyone to create datasets and models in minutes and without any code.

Blog: https://huggingface.co./blog/synthetic-data-generator
Space: argilla/synthetic-data-generator
·
nataliaElv 
posted an update 14 days ago
view post
Post
1244
How do your annotations for FineWeb2 compare to your teammates'?

I started contributing some annotations to the FineWeb2 collaborative annotation sprint and I wanted to know if my labelling trends were similar to those of my teammates.

I did some analysis and I wasn't surprised to see that I'm being a bit harsher on my evaluations than my mates 😂


Do you want to see how your annotations compare to others?
👉 Go to this Gradio space: nataliaElv/fineweb2_compare_my_annotations
✍️ Enter the dataset that you've contributed to and your Hugging Face username.

How were your results?
- Contribute some annotations: data-is-better-together/fineweb-c
- Join your language channel in Rocket chat: HuggingFaceFW/discussion
burtenshaw 
posted an update 15 days ago
view post
Post
2386
Quick update from week 1 of smol course. The community is taking the driving seat and using the material for their own projects. If you want to do the same, join in!

- we have ongoing translation projects in Korean, Vietnamese, Portuguese, and Spanish
- 3 chapters are ready for students. On topics like, instruction tuning, preference alignment, and parameter efficient fine tuning
- 3 chapters are in progress on evaluation, vision language models, and synthetic data.
- around 780 people have forked the repo to use it for learning, teaching, sharing.

⏭️ Next step is to support people that want to use the course for teaching, content creation, internal knowledge sharing, or anything. If you're into this. Drop an issue or PR

REPO: https://buff.ly/3ZCMKX2
discord channel: https://buff.ly/4f9F8jA
sayakpaul 
posted an update 16 days ago
view post
Post
2040
Introducing a high-quality open-preference dataset to further this line of research for image generation.

Despite being such an inseparable component for modern image generation, open preference datasets are a rarity!

So, we decided to work on one with the community!

Check it out here:
https://huggingface.co./blog/image-preferences
·