Datasets - a igormolybog Collection

igormolybog 's Collections

Domain spec fine-tuning

Inference speed

llama + WebWork

evals

Solver training

Hetero training

Open

Agents

Imagen

Datasets

updated May 8

Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models

Paper • 2311.06783 • Published Nov 12, 2023 • 26
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning

Paper • 2311.07574 • Published Nov 13, 2023 • 14
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding

Paper • 2401.04575 • Published Jan 9 • 14
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

Paper • 2402.00159 • Published Jan 31 • 59
Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning

Paper • 2402.06619 • Published Feb 9 • 54
AutoMathText: Autonomous Data Selection with Language Models for Mathematical Texts

Paper • 2402.07625 • Published Feb 12 • 11
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

Paper • 2402.10176 • Published Feb 15 • 35
StarCoder 2 and The Stack v2: The Next Generation

Paper • 2402.19173 • Published Feb 29 • 136
WildChat: 1M ChatGPT Interaction Logs in the Wild

Paper • 2405.01470 • Published May 2 • 59
NeMo-Aligner: Scalable Toolkit for Efficient Model Alignment

Paper • 2405.01481 • Published May 2 • 25