Data Is Better Together

community

Activity Feed

AI & ML interests

Building better datasets together

Recent Activity

davanstrien updated a dataset 15 minutes ago

data-is-better-together/fineweb-c-progress

davanstrien new activity 1 day ago

data-is-better-together/fineweb-c-dashboard:add some links

davanstrien updated a Space 1 day ago

data-is-better-together/fineweb-c-dashboard

View all activity

data-is-better-together's activity

davanstrien

updated a dataset 15 minutes ago

data-is-better-together/fineweb-c-progress

Viewer • Updated 15 minutes ago • 668 • 556 • 2

davanstrien

in data-is-better-together/fineweb-c-dashboard 1 day ago

add some links

#1 opened 1 day ago by

davanstrien

updated a Space 1 day ago

Running

🌐📊

FineWeb 2 - Community Leaderboard

sayakpaul

posted an update 1 day ago

Post

2326

Commits speak louder than words 🤪

* 4 new video models
* Multiple image models, including SANA & Flux Control
* New quantizers -> GGUF & TorchAO
* New training scripts

Enjoy this holiday-special Diffusers release 🤗
Notes: https://github.com/huggingface/diffusers/releases/tag/v0.32.0

davanstrien

in data-is-better-together/fineweb-c 2 days ago

fix rocket chat link

#3 opened 3 days ago by

davanstrien

updated a dataset 3 days ago

data-is-better-together/fineweb-c

Viewer • Updated 3 days ago • 25k • 283 • 21

davanstrien

updated a collection 4 days ago

FineWeb2 Collaborative Annotation Sprint

Collection

5 items • Updated 1 day ago • 6

davanstrien

in data-is-better-together/fineweb-c 5 days ago

<img src='https://i.pinimg.com/736x/ce/15/58/ce15584f4a9aaf701630a8902c6302c2.jpg'>

#1 opened 5 days ago by

usama121

<img src='https://i.pinimg.com/736x/ce/15/58/ce15584f4a9aaf701630a8902c6302c2.jpg'>

#2 opened 5 days ago by

usama121

davanstrien

posted an update 5 days ago

Post

1512

Introducing FineWeb-C 🌐🎓, a community-built dataset for improving language models in ALL languages.

Inspired by FineWeb-Edu the community is labelling the educational quality of texts for many languages.

318 annotators, 32K+ annotations, 12 languages - and growing! 🌍

data-is-better-together/fineweb-c

davanstrien

in data-is-better-together/fineweb-communications-pack 5 days ago

Upload fineweb-c-card-header.png

#2 opened 5 days ago by

davanstrien

updated a Space 5 days ago

Running

🌐📢

FineWeb 2 Communications Pack

burtenshaw

posted an update 6 days ago

Post

2540

People are flexing their end of year stats, so I made this app to show hub stats in a tidy design!

Thanks @Ameeeee and @jfcalvo for the feature from Argilla!
burtenshaw/recap

1 reply

davidberenstein1957

posted an update 7 days ago

Post

1264

🐇 Tumble down the AI rabbit hole without any technical knowledge!

Explore AI models on the Hub by a simple and quick search

Demo: davidberenstein1957/transformers-pipeline-playground

sayakpaul

posted an update 7 days ago

Post

1550

In the past seven days, the Diffusers team has shipped:

1. Two new video models
2. One new image model
3. Two new quantization backends
4. Three new fine-tuning scripts
5. Multiple fixes and library QoL improvements

Coffee on me if someone can guess 1 - 4 correctly.

1 reply

nataliaElv

posted an update 8 days ago

Post

1598

If you are still wondering how the FineWeb2 annotations are done, how to follow the guidelines or how Argilla works, this is your video!

I go through a few samples of the FineWeb2 dataset and classify them based on their educational content. Check it out!

https://www.youtube.com/watch?v=_-ORB4WAVGU

davidberenstein1957

posted an update 9 days ago

Post

4106

Introducing the Synthetic Data Generator, a user-friendly application that takes a no-code approach to creating custom datasets with Large Language Models (LLMs). The best part: A simple step-by-step process, making dataset creation a non-technical breeze, allowing anyone to create datasets and models in minutes and without any code.

Blog: https://huggingface.co./blog/synthetic-data-generator
Space: argilla/synthetic-data-generator

4 replies

nataliaElv

posted an update 14 days ago

Post

1244

How do your annotations for FineWeb2 compare to your teammates'?

I started contributing some annotations to the FineWeb2 collaborative annotation sprint and I wanted to know if my labelling trends were similar to those of my teammates.

I did some analysis and I wasn't surprised to see that I'm being a bit harsher on my evaluations than my mates 😂

Do you want to see how your annotations compare to others?
👉 Go to this Gradio space: nataliaElv/fineweb2_compare_my_annotations
✍️ Enter the dataset that you've contributed to and your Hugging Face username.

How were your results?
- Contribute some annotations: data-is-better-together/fineweb-c
- Join your language channel in Rocket chat: HuggingFaceFW/discussion

burtenshaw

posted an update 15 days ago

Post

2386

Quick update from week 1 of smol course. The community is taking the driving seat and using the material for their own projects. If you want to do the same, join in!

- we have ongoing translation projects in Korean, Vietnamese, Portuguese, and Spanish
- 3 chapters are ready for students. On topics like, instruction tuning, preference alignment, and parameter efficient fine tuning
- 3 chapters are in progress on evaluation, vision language models, and synthetic data.
- around 780 people have forked the repo to use it for learning, teaching, sharing.

⏭️ Next step is to support people that want to use the course for teaching, content creation, internal knowledge sharing, or anything. If you're into this. Drop an issue or PR

REPO: https://buff.ly/3ZCMKX2
discord channel: https://buff.ly/4f9F8jA

sayakpaul

posted an update 16 days ago

Post

2040

Introducing a high-quality open-preference dataset to further this line of research for image generation.

Despite being such an inseparable component for modern image generation, open preference datasets are a rarity!

So, we decided to work on one with the community!

Check it out here:
https://huggingface.co./blog/image-preferences

7 replies

AI & ML interests

Recent Activity

Team members 15

data-is-better-together's activity

add some links

FineWeb 2 - Community Leaderboard

fix rocket chat link

<img src='https://i.pinimg.com/736x/ce/15/58/ce15584f4a9aaf701630a8902c6302c2.jpg'>

<img src='https://i.pinimg.com/736x/ce/15/58/ce15584f4a9aaf701630a8902c6302c2.jpg'>

Upload fineweb-c-card-header.png

FineWeb 2 Communications Pack