GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation Paper • 2411.18499 • Published Nov 27, 2024 • 18
VLSBench: Unveiling Visual Leakage in Multimodal Safety Paper • 2411.19939 • Published Nov 29, 2024 • 10
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? Paper • 2412.02611 • Published Dec 3, 2024 • 24
U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs Paper • 2412.03205 • Published Dec 4, 2024 • 16
ProcessBench: Identifying Process Errors in Mathematical Reasoning Paper • 2412.06559 • Published Dec 9, 2024 • 83
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations Paper • 2412.07626 • Published Dec 10, 2024 • 22
VisionArena: 230K Real World User-VLM Conversations with Preference Labels Paper • 2412.08687 • Published Dec 11, 2024 • 13
SCBench: A KV Cache-Centric Analysis of Long-Context Methods Paper • 2412.10319 • Published Dec 13, 2024 • 10
Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models Paper • 2412.12606 • Published Dec 17, 2024 • 42
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks Paper • 2412.14161 • Published Dec 18, 2024 • 52
RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment Paper • 2412.13746 • Published Dec 18, 2024 • 9
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings Paper • 2501.01257 • Published Jan 2 • 53
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models Paper • 2501.02955 • Published Jan 6 • 45
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding? Paper • 2501.05510 • Published Jan 9 • 44
HALoGEN: Fantastic LLM Hallucinations and Where to Find Them Paper • 2501.08292 • Published Jan 14 • 17
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents Paper • 2501.08828 • Published Jan 15 • 32
ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario Paper • 2501.10132 • Published Jan 17 • 20
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding Paper • 2501.12380 • Published Jan 21 • 86
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models Paper • 2502.00698 • Published Feb 2 • 24
BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models Paper • 2502.07346 • Published Feb 11 • 53
SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation Paper • 2502.08168 • Published Feb 12 • 12
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency Paper • 2502.09621 • Published Feb 13 • 27
ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models Paper • 2502.09696 • Published Feb 13 • 43
SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? Paper • 2502.12115 • Published Feb 17 • 45
The Mirage of Model Editing: Revisiting Evaluation in the Wild Paper • 2502.11177 • Published Feb 16 • 10
MLGym: A New Framework and Benchmark for Advancing AI Research Agents Paper • 2502.14499 • Published Feb 20 • 190
StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following Paper • 2502.14494 • Published Feb 20 • 14
Q-Eval-100K: Evaluating Visual Quality and Alignment Level for Text-to-Vision Content Paper • 2503.02357 • Published Mar 4 • 7
Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection Paper • 2503.01449 • Published Mar 3 • 4
Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Enhancement Protocol Paper • 2503.05860 • Published Mar 7 • 9
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM Paper • 2503.14478 • Published Mar 18 • 44
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding Paper • 2503.12797 • Published Mar 17 • 29
MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process Errors Identification Paper • 2503.12505 • Published Mar 16 • 9
PEBench: A Fictitious Dataset to Benchmark Machine Unlearning for Multimodal Large Language Models Paper • 2503.12545 • Published Mar 16 • 5
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning? Paper • 2503.19990 • Published 24 days ago • 33
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness Paper • 2503.21755 • Published 22 days ago • 31
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing Paper • 2504.02826 • Published 15 days ago • 67
VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning Paper • 2504.07956 • Published 8 days ago • 43
S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models Paper • 2504.10368 • Published 5 days ago • 19
LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models Paper • 2504.10415 • Published 5 days ago • 7
ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness Paper • 2504.10514 • Published 9 days ago • 43