Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation Jun 20, 2024 • 12
Synthetic dataset generation techniques: generating custom sentence similarity data May 23, 2024 • 16
Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20, 2024 • 74
Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model Aug 22, 2023 • 29
Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub Aug 2, 2023 • 1
Qwen2.5-1M Collection The long-context version of Qwen2.5, supporting 1M-token context lengths • 2 items • Updated 1 day ago • 73
view article Article Explore, Curate and Vector Search Any Hugging Face Dataset with Nomic Atlas By MaxNomic • 4 days ago • 29
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Paper • 2501.12948 • Published 5 days ago • 216
The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models Paper • 2501.09653 • Published 11 days ago • 12
AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages Paper • 2501.08284 • Published 13 days ago • 6
view article Article Train 400x faster Static Embedding Models with Sentence Transformers 13 days ago • 124
OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training Paper • 2501.08197 • Published 13 days ago • 7
high-quality Chinese training datasets Collection a suite of high-quality Chinese datasets, used for pretraining, fine-tuning or preference alignment. And the models trained on these datasets. • 12 items • Updated 10 days ago • 9
BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature Paper • 2501.07171 • Published 14 days ago • 49
HistBERTurk-Models Collection Fine-tuned BERTurk models for historical Turkish. • 3 items • Updated 22 days ago • 2
Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models Paper • 2501.04828 • Published 19 days ago • 11
SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution Paper • 2501.05040 • Published 18 days ago • 15
BoundingDocs: a Unified Dataset for Document Question Answering with Spatial Annotations Paper • 2501.03403 • Published 21 days ago • 4
view article Article Synthetic Data Generation with FastData and Hugging Face By asoria • 20 days ago • 14
view article Article Crowd-sourced Open Preference Dataset for Text-to-Image Generation By RapidataAI • 20 days ago • 18
CaseSumm: A Large-Scale Dataset for Long-Context Summarization from U.S. Supreme Court Opinions Paper • 2501.00097 • Published 28 days ago • 1