Open LLM Leaderboard
Track, rank and evaluate open LLMs and chatbots
Gathering benchmark spaces on the hub (beyond the Open LLM Leaderboard)
Track, rank and evaluate open LLMs and chatbots
Note π The π€ Open LLM Leaderboard aims to track, rank and evaluate open LLMs and chatbots. π€ Submit a model for automated evaluation on the π€ GPU cluster on the βSubmitβ page!
Note Massive Text Embedding Benchmark (MTEB) Leaderboard.
Note π This leaderboard is based on the following three benchmarks: Chatbot Arena - a crowdsourced, randomized battle platform. We use 70K+ user votes to compute Elo ratings. MT-Bench - a set of challenging multi-turn questions. We use GPT-4 to grade the model responses. MMLU (5-shot) - a test to measure a modelβs multitask accuracy on 57 tasks.
Note The π€ LLM-Perf Leaderboard ποΈ aims to benchmark the performance (latency, throughput & memory) of Large Language Models (LLMs) with different hardwares, backends and optimizations using Optimum-Benchmark and Optimum flavors. Anyone from the community can request a model or a hardware/backend/optimization configuration for automated benchmarking:
Note Compare performance of base multilingual code generation models on HumanEval benchmark and MultiPL-E. We also measure throughput and provide information about the models. We only compare open pre-trained multilingual code models, that people can start from as base models for their trainings.
Note The π€ Open ASR Leaderboard ranks and evaluates speech recognition models on the Hugging Face Hub. We report the Average WER (β¬οΈ) and RTF (β¬οΈ) - lower the better. Models are ranked based on their Average WER, from lowest to highest
Note The MT-Bench Browser (see Chatbot arena)
Open Persian LLM Leaderboard