From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Abstract
Assessment and evaluation have long been critical challenges in artificial intelligence (AI) and natural language processing (NLP). However, traditional methods, whether matching-based or embedding-based, often fall short of judging subtle attributes and delivering satisfactory results. Recent advancements in Large Language Models (LLMs) inspire the "LLM-as-a-judge" paradigm, where LLMs are leveraged to perform scoring, ranking, or selection across various tasks and applications. This paper provides a comprehensive survey of LLM-based judgment and assessment, offering an in-depth overview to advance this emerging field. We begin by giving detailed definitions from both input and output perspectives. Then we introduce a comprehensive taxonomy to explore LLM-as-a-judge from three dimensions: what to judge, how to judge and where to judge. Finally, we compile benchmarks for evaluating LLM-as-a-judge and highlight key challenges and promising directions, aiming to provide valuable insights and inspire future research in this promising research area. Paper list and more resources about LLM-as-a-judge can be found at https://github.com/llm-as-a-judge/Awesome-LLM-as-a-judge and https://llm-as-a-judge.github.io.
Community
More resources on LLM-as-a-judge are on the website: https://llm-as-a-judge.github.io
We release the paper list about LLM-as-a-judge at: https://github.com/llm-as-a-judge/Awesome-LLM-as-a-judge
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- RevisEval: Improving LLM-as-a-Judge via Response-Adapted References (2024)
- JudgeBench: A Benchmark for Evaluating LLM-based Judges (2024)
- Self-rationalization improves LLM as a fine-grained judge (2024)
- ReIFE: Re-evaluating Instruction-Following Evaluation (2024)
- HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly (2024)
- Unveiling Context-Aware Criteria in Self-Assessing LLMs (2024)
- Large Language Models Are Active Critics in NLG Evaluation (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper