Aya Datasets

CohereForAI 's Collections

C4AI Aya Vision

Command Models

C4AI Aya Expanse

Multilingual LLM Evaluation

Aya Datasets

C4AI Aya 23

updated 7 days ago

The Aya Collection is a massive multilingual collection for over 100 languages consisting of 513 million instances of prompts and completions.

Upvote

CohereForAI/aya_collection

Viewer • Updated Jun 28, 2024 • 514M • 25.9k • 220
Note The Aya Collection is a massive multilingual collection consisting of 513 million instances of prompts and completions covering a wide range of tasks. This collection incorporates instruction-style templates from fluent speakers and applies them to a curated list of datasets, as well as translations of instruction-style datasets into 101 languages.
CohereForAI/aya_dataset

Viewer • Updated Jun 28, 2024 • 206k • 2.34k • 299
Note The Aya Dataset is a multilingual instruction fine-tuning dataset curated by an open-science community via Aya Annotation Platform from Cohere For AI. The dataset contains a total of 204k human-annotated prompt-completion pairs along with the demographics data of the annotators.
CohereForAI/aya_evaluation_suite

Viewer • Updated Jun 28, 2024 • 26.8k • 2.05k • 49
Note Aya Evaluation Suite contains open-ended conversation-style prompts to evaluate multilingual open-ended generation quality. To strike a balance between language coverage and the quality that comes with human curation, we create an evaluation suite that covers 101 languages for evaluating conversational abilities of language models
CohereForAI/aya_collection_language_split

Viewer • Updated Jun 28, 2024 • 514M • 33.1k • 95
Note This is same as original aya_collection, only varies in structure of upload. While the original aya_collection is structured by folders split according to dataset name, this dataset is split by language. We recommend you use this version of the dataset if you are only interested in downloading all of the Aya collection for a single or smaller set of languages.
CohereForAI/aya_redteaming

Viewer • Updated Jun 28, 2024 • 7.42k • 427 • 21
Note The Aya Red-teaming dataset is a human-annotated multilingual red-teaming dataset consisting of harmful prompts in 8 languages across 9 different categories of harm with explicit labels for "global" and "local" harm.

Upvote