Daniel van Strien's picture

Daniel van Strien PRO

davanstrien

AI & ML interests

Machine Learning Librarian

Recent Activity

updated a dataset about 5 hours ago
librarian-bots/dataset_cards_with_metadata
updated a dataset about 6 hours ago
data-is-better-together/fineweb-c-progress
updated a model about 15 hours ago
knowledgator/modern-gliner-bi-large-v1.0
View all activity

Articles

Organizations

Hugging Face's profile picture Notebooks-explorers's profile picture Living with Machines's profile picture BigScience Workshop's profile picture Spaces-explorers's profile picture BigScience Catalogue Data's profile picture Hacks/Hackers's profile picture flyswot's profile picture BigScience: LMs for Historical Texts's profile picture Cohere For AI's profile picture Webhooks Explorers (BETA)'s profile picture HuggingFaceM4's profile picture Open Access AI Collective's profile picture HF Canonical Model Maintainers's profile picture BigLAM: BigScience Libraries, Archives and Museums's profile picture Hugging Face OSS Metrics's profile picture ImageIN's profile picture Stable Diffusion Bias Eval's profile picture Librarian Bots's profile picture Blog-explorers's profile picture Hacktoberfest 2023's profile picture Hugging Face TB Research's profile picture geospatial's profile picture HF-IA-archiving's profile picture 2A2I Legacy Models & Datasets's profile picture testy's profile picture DIBT-for-Klingon's profile picture Wikimedia Movement's profile picture DIBT-for-Esperanto's profile picture Journalists on Hugging Face's profile picture PleIAs's profile picture Argilla Explorers's profile picture Persian AI Community's profile picture Data Is Better Together's profile picture Social Post Explorers's profile picture OMOTO AI's profile picture academic-datasets's profile picture HuggingFaceFW-Dev's profile picture Hugging Face Discord Community's profile picture UCSF-JHU Opioid Industry Documents Archive's profile picture Dataset Tools's profile picture PDFPages's profile picture dibt-private's profile picture Data Is Better Together Contributor's profile picture Bluesky Community's profile picture

Posts 38

view post
Post
1502
Introducing FineWeb-C πŸŒπŸŽ“, a community-built dataset for improving language models in ALL languages.

Inspired by FineWeb-Edu the community is labelling the educational quality of texts for many languages.

318 annotators, 32K+ annotations, 12 languages - and growing! 🌍

data-is-better-together/fineweb-c
view post
Post
485
Increasingly, LLMs are becoming very useful for helping scale annotation tasks, i.e. labelling and filtering. When combined with the structured generation, this can be a very scalable way of doing some pre-annotation without requiring a large team of human annotators.

However, there are quite a few cases where it still doesn't work well. This is a nice paper looking at the limitations of LLM as an annotator for Low Resource Languages: On Limitations of LLM as Annotator for Low Resource Languages (2411.17637).

Humans will still have an important role in the loop to help improve models for all languages (and domains).