Daniel van Strien's picture

Daniel van Strien PRO

davanstrien

·

https://danielvanstrien.xyz/

AI & ML interests

Machine Learning Librarian

Recent Activity

updated a dataset 36 minutes ago

data-is-better-together/fineweb-c-progress

reacted to fdaudens's post with ❤️ about 2 hours ago

Yes, DeepSeek R1's release is impressive. But the real story is what happened in just 7 days after: - Original release: 8 models, 540K downloads. Just the beginning... - The community turned those open-weight models into +550 NEW models on Hugging Face. Total downloads? 2.5M—nearly 5X the originals. The reason? DeepSeek models are open-weight, letting anyone build on top of them. Interesting to note that the community focused on quantized versions for better efficiency & accessibility. They want models that use less memory, run faster, and are more energy-efficient. When you empower builders, innovation explodes. For everyone. 🚀 The most popular community model? @bartowski's DeepSeek-R1-Distill-Qwen-32B-GGUF version — 1M downloads alone.

liked a model about 2 hours ago

deepseek-ai/Janus-Pro-7B

View all activity

Articles

Explore, Curate and Vector Search Any Hugging Face Dataset with Nomic Atlas

FineWeb2-C: Help Build Better Language Models in Your Language

Open Preference Dataset for Text-to-Image Generation by the 🤗 Community

Let’s make a generation of amazing image generation models

Share your open ML datasets on Hugging Face Hub!

Scaling AI-based Data Processing with Hugging Face + Dask

Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation

Data Is Better Together: A Look Back and Forward

Synthetic dataset generation techniques: generating custom sentence similarity data

Synthetic dataset generation techniques: Self-Instruct

Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia?

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Data is better together

Extracting Insights from Model Cards Using Open Large Language Models

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Introducing BERTopic Integration with Hugging Face Hub

Jupyter X Hugging Face

Image search with 🤗 datasets

Organizations

davanstrien's activity

upvoted a collection 1 day ago

Qwen2.5-1M

The long-context version of Qwen2.5, supporting 1M-token context lengths • 2 items • Updated 1 day ago • 73

upvoted an article 4 days ago

Article

Explore, Curate and Vector Search Any Hugging Face Dataset with Nomic Atlas

By

•

4 days ago

• 29

upvoted a paper 4 days ago

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Paper • 2501.12948 • Published 5 days ago • 216

upvoted an article 6 days ago

Article

Exploring Synthetic Data Generation with DataDreamer

By

•

6 days ago

• 6

upvoted a paper 10 days ago

The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models

Paper • 2501.09653 • Published 11 days ago • 12

upvoted a paper 11 days ago

AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages

Paper • 2501.08284 • Published 13 days ago • 6

upvoted an article 12 days ago

Article

Train 400x faster Static Embedding Models with Sentence Transformers

13 days ago

• 124

upvoted a paper 12 days ago

OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training

Paper • 2501.08197 • Published 13 days ago • 7

upvoted a collection 12 days ago

high-quality Chinese training datasets

a suite of high-quality Chinese datasets, used for pretraining, fine-tuning or preference alignment. And the models trained on these datasets. • 12 items • Updated 10 days ago • 9

upvoted a paper 13 days ago

BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature

Paper • 2501.07171 • Published 14 days ago • 49

upvoted a collection 18 days ago

HistBERTurk-Models

Fine-tuned BERTurk models for historical Turkish. • 3 items • Updated 22 days ago • 2

upvoted 3 papers 18 days ago

Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models

Paper • 2501.04828 • Published 19 days ago • 11

SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution

Paper • 2501.05040 • Published 18 days ago • 15

BoundingDocs: a Unified Dataset for Document Question Answering with Spatial Annotations

Paper • 2501.03403 • Published 21 days ago • 4

upvoted 2 articles 20 days ago

Article

Synthetic Data Generation with FastData and Hugging Face

By

•

20 days ago

• 14

Article

Crowd-sourced Open Preference Dataset for Text-to-Image Generation

By

•

20 days ago

• 18

upvoted a collection 21 days ago

METAGENE-1

METAGENE-1 Models • 5 items • Updated 20 days ago • 5

upvoted a paper 21 days ago

CaseSumm: A Large-Scale Dataset for Long-Context Summarization from U.S. Supreme Court Opinions

Paper • 2501.00097 • Published 28 days ago • 1

upvoted 2 collections about 1 month ago

🥂 FineWeb2

3 items • Updated Dec 8, 2024 • 12

QVQ

QVQ: Qwen models for visual reasoning • 7 items • Updated 26 days ago • 40