MMTEB Collection Our contribution to the Massive Multilingual Text Embedding Benchmark (MMTEB). Retrieval and reranking benchmarks in 16 languages. • 4 items • Updated Jun 6, 2024 • 2
CommonCrawl Collection Large web-mined general corpus based on CommonCrawl. • 7 items • Updated Dec 8, 2024 • 2
NoLiMa: Long-Context Evaluation Beyond Literal Matching Paper • 2502.05167 • Published about 1 month ago • 15
view article Article Finding Moroccan Arabic (Darija) in Fineweb 2 By omarkamali and 3 others • Dec 8, 2024 • 22
OpenCoder Collection OpenCoder is an open and reproducible code LLM family which matches the performance of top-tier code LLMs. • 8 items • Updated Nov 23, 2024 • 80
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages Paper • 2410.23825 • Published Oct 31, 2024 • 4
LLM Reasoning Papers Collection Papers to improve reasoning capabilities of LLMs • 20 items • Updated Jan 15 • 120
MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment Paper • 2410.05873 • Published Oct 8, 2024 • 3