Leaderboards - a kaizuberbuehler Collection

kaizuberbuehler 's Collections

Reasoning, Thinking, RL and Test-Time Scaling

Vision Language Models

Foundation Models

Synthetic Data and Self-Improvement

Agents

LM Prompt Engineering

LM Capabilities and Scaling

LM Architectures

Code Generation

EXL2 Quantized Models

Leaderboards

updated 13 days ago

Running

183

183

BigCodeBench Leaderboard

🥇

Explore and analyze code evaluation data
Running

703

703

UGI Leaderboard

📢

Display a leaderboard with UGI scores
Running

4.14k

4.14k

Chatbot Arena Leaderboard

🏆

Display chatbot leaderboard and statistics
Running on CPU Upgrade

5.01k

5.01k

MTEB Leaderboard

🥇

Select benchmarks and languages for text embeddings evaluation
Running on CPU Upgrade

12.7k

12.7k

Open LLM Leaderboard

🏆

Track, rank and evaluate open LLMs and chatbots
Running

1.18k

1.18k

Big Code Models Leaderboard

📈

Submit code models for evaluation on benchmarks
Running

93

93

OpenCompass LLM Leaderboard

🚀

Display a web page
Running on CPU Upgrade

662

662

Open ASR Leaderboard

🏆

Request evaluation of a speech recognition model
Running

362

362

Text To Image Leaderboard

📊

Generate images from text descriptions
Running

289

289

LLM Performance Leaderboard

🐨

View LLM Performance Leaderboard
WebArena: A Realistic Web Environment for Building Autonomous Agents

Paper • 2307.13854 • Published Jul 25, 2023 • 25
Running

84

84

Zebra Logic Bench

🦓

Display and explore zebra puzzle leaderboard
Runtime error

22

22

LiveBench

🥇
Running

85

85

imgsys.org

📊

imgsys.org -- arena for text guided image generation
Running

49

49

ZeroEval Leaderboard

📊

Embed and use ZeroEval for evaluation tasks
SWE-bench-java: A GitHub Issue Resolving Benchmark for Java

Paper • 2408.14354 • Published Aug 26, 2024 • 41
Configuration error

53

53

Hallucination Evaluation Leaderboard

⚡
Running on CPU Upgrade

111

111

HHEM Leaderboard

🥇

Explore and submit LLM benchmark evaluations
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Paper • 2404.07972 • Published Apr 11, 2024 • 48
A3: Android Agent Arena for Mobile GUI Agents

Paper • 2501.01149 • Published Jan 2 • 22
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Paper • 2412.14161 • Published Dec 18, 2024 • 51
Running on Zero

311

311

TTS Spaces Arena

🤗

Blind vote on HF TTS models!
MiniMax-01: Scaling Foundation Models with Lightning Attention

Paper • 2501.08313 • Published Jan 14 • 274
Running

14

14

BrowserGym Leaderboard

🏆

Display data interactively
Running on CPU Upgrade

29

29

DABstep Leaderboard

🕺

DABstep Reasoning Benchmark Leaderboard
The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles

Paper • 2502.01081 • Published Feb 3 • 14
Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance

Paper • 2502.08127 • Published 25 days ago • 50
ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models

Paper • 2502.09696 • Published 24 days ago • 38
Running on CPU Upgrade

233

233

Agent Leaderboard

💬

Ranking of LLMs for agentic tasks