CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution
Abstract
Efficient and accurate evaluation is crucial for the continuous improvement of large language models (LLMs). Among various assessment methods, subjective evaluation has garnered significant attention due to its superior alignment with real-world usage scenarios and human preferences. However, human-based evaluations are costly and lack reproducibility, making precise automated evaluators (judgers) vital in this process. In this report, we introduce CompassJudger-1, the first open-source all-in-one judge LLM. CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility. It is capable of: 1. Performing unitary scoring and two-model comparisons as a reward model; 2. Conducting evaluations according to specified formats; 3. Generating critiques; 4. Executing diverse tasks like a general LLM. To assess the evaluation capabilities of different judge models under a unified setting, we have also established JudgerBench, a new benchmark that encompasses various subjective evaluation tasks and covers a wide range of topics. CompassJudger-1 offers a comprehensive solution for various evaluation tasks while maintaining the flexibility to adapt to diverse requirements. Both CompassJudger and JudgerBench are released and available to the research community athttps://github.com/open-compass/CompassJudger. We believe that by open-sourcing these tools, we can foster collaboration and accelerate progress in LLM evaluation methodologies.
Community
- https://huggingface.co./opencompass/CompassJudger-1-1.5B-Instruct
- https://huggingface.co./opencompass/CompassJudger-1-7B-Instruct
- https://huggingface.co./opencompass/CompassJudger-1-14B-Instruct
- https://huggingface.co./opencompass/CompassJudger-1-32B-Instruct
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Direct Judgement Preference Optimization (2024)
- JudgeBench: A Benchmark for Evaluating LLM-based Judges (2024)
- RevisEval: Improving LLM-as-a-Judge via Response-Adapted References (2024)
- LLM-as-a-Judge & Reward Model: What They Can and Cannot Do (2024)
- Self-Boosting Large Language Models with Synthetic Preference Data (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 6
Browse 6 models citing this paperDatasets citing this paper 0
No dataset linking this paper