Multilingual LLM Evaluation
Multilingual Evaluation Benchmarks
Viewer β’ Updated β’ 487k β’ 14.2k β’ 109Note Global-MMLU π is a multilingual evaluation set collection of exams in 42 languages, including English. This dataset greatly improves multilingual coverage and quality of the english MMLU using professional translations and crowd-sourced post-edits. It also includes cultural sensitivity annotations and classifies them as Culturally Sensitive (CS) π½ or Culturally Agnostic (CA) βοΈ.
CohereForAI/Global-MMLU-Lite
Viewer β’ Updated β’ 9.23k β’ 4.75k β’ 16Note Global-MMLU-Lite is a multilingual evaluation set spanning 15 languages, including English. It is "lite" version of the original Global-MMLU dataset π. The samples in Global-MMLU-Lite are corresponding to languages which are fully human translated or post-edited in the original Global-MMLU dataset.
CohereForAI/m-ArenaHard
Viewer β’ Updated β’ 10.5k β’ 688 β’ 18Note The m-ArenaHard dataset is an extremely challenging multilingual LLM evaluation set for measuring quaity of open-ended generations. This dataset was created by translating the prompts from the originally English-only LMarena (formerly LMSYS) arena-hard-auto-v0.1 test dataset using Google Translate API v3 to 22 languages. For each language, there are 500 challenging user queries sourced from Chatbot Arena.
CohereForAI/include-base-44
Viewer β’ Updated β’ 23.5k β’ 3.79k β’ 31Note INCLUDE is a comprehensive collection of in-language exams across 44 languages that evaluates multilingual LLMs for performance in the actual language environments where they would be deployed. It contains 22,637 4-option multiple-choice-questions (MCQ) extracted from academic and professional exams, covering 57 topics, including regional knowledge.
CohereForAI/include-lite-44
Viewer β’ Updated β’ 10.8k β’ 1.41k β’ 11Note INCLUDE is a comprehensive knowledge- and reasoning-centric benchmark across 44 languages that evaluates multilingual LLMs for performance in the actual language environments where they would be deployed. For a quicker evaluation, you can use include-lite-44, which is a subset of include-base-44, covering the same 44 languages.
CohereForAI/aya_redteaming
Viewer β’ Updated β’ 7.42k β’ 427 β’ 21Note The Aya Red-teaming dataset is a human-annotated multilingual red-teaming dataset consisting of harmful prompts in 8 languages across 9 different categories of harm with explicit labels for "global" and "local" harm.
CohereForAI/aya_evaluation_suite
Viewer β’ Updated β’ 26.8k β’ 2.05k β’ 49Note Aya Evaluation Suite contains open-ended conversation-style prompts to evaluate multilingual open-ended generation quality. To strike a balance between language coverage and the quality that comes with human curation, we create an evaluation suite that covers 101 languages for evaluating conversational abilities of language models.
C4AI-Community/multilingual-reward-bench
Viewer β’ Updated β’ 66.8k β’ 1.89k β’ 27Note M-RewardBench is a benchmark for 23 typologically diverse languages. M-RewardBench contains prompt-chosen-rejected preference triples obtained by curating and translating chat, safety, and reasoning instances.