Benchmarking LLMs for Political Science: A United Nations Perspective Paper • 2502.14122 • Published Feb 19 • 2
IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval Paper • 2503.04644 • Published 28 days ago • 20
ExpertGenQA: Open-ended QA generation in Specialized Domains Paper • 2503.02948 • Published about 1 month ago
Toward Stable and Consistent Evaluation Results: A New Methodology for Base Model Evaluation Paper • 2503.00812 • Published Mar 2
Deceptive Humor: A Synthetic Multilingual Benchmark Dataset for Bridging Fabricated Claims with Humorous Content Paper • 2503.16031 • Published 15 days ago • 3