Running on CPU Upgrade 12.1k π Open LLM Leaderboard Track, rank and evaluate open LLMs and chatbots
WebArena: A Realistic Web Environment for Building Autonomous Agents Paper β’ 2307.13854 β’ Published Jul 25, 2023 β’ 23
SWE-bench-java: A GitHub Issue Resolving Benchmark for Java Paper β’ 2408.14354 β’ Published Aug 26 β’ 40
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments Paper β’ 2404.07972 β’ Published Apr 11 β’ 46