Benchmark and Evaluation - a YedsonUQ Collection

YedsonUQ 's Collections

Efficient Inference

Test-Time Scaling (TTS)

Foundational Deep Learning - Architecture

AI-Automated Scientific Research

Benchmark and Evaluation

Distributed Training and Federated Learning

Explainable AI - Interpretable AI

Learning Paradigm/Scheme

Models

Reinforcement Learning (RL)

Retrieval Augmented Generation (RAG)

Uncertainty Quantification

Survey

Benchmark and Evaluation

updated about 18 hours ago

Humanity's Last Exam

Paper • 2501.14249 • Published Jan 24 • 71
Benchmarking LLMs for Political Science: A United Nations Perspective

Paper • 2502.14122 • Published Feb 19 • 2
IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval

Paper • 2503.04644 • Published 28 days ago • 20
ExpertGenQA: Open-ended QA generation in Specialized Domains

Paper • 2503.02948 • Published about 1 month ago
Toward Stable and Consistent Evaluation Results: A New Methodology for Base Model Evaluation

Paper • 2503.00812 • Published Mar 2
Deceptive Humor: A Synthetic Multilingual Benchmark Dataset for Bridging Fabricated Claims with Humorous Content

Paper • 2503.16031 • Published 15 days ago • 3
JudgeLRM: Large Reasoning Models as a Judge

Paper • 2504.00050 • Published 4 days ago • 43