Papers
arxiv:2405.01535

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Published on May 2
Β· Submitted by akhaliq on May 3
#2 Paper of the day
Authors:
,
,
,

Abstract

Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from various LMs. However, concerns including transparency, controllability, and affordability strongly motivate the development of open-source LMs specialized in evaluations. On the other hand, existing open evaluator LMs exhibit critical shortcomings: 1) they issue scores that significantly diverge from those assigned by humans, and 2) they lack the flexibility to perform both direct assessment and pairwise ranking, the two most prevalent forms of assessment. Additionally, they do not possess the ability to evaluate based on custom evaluation criteria, focusing instead on general attributes like helpfulness and harmlessness. To address these issues, we introduce Prometheus 2, a more powerful evaluator LM than its predecessor that closely mirrors human and GPT-4 judgements. Moreover, it is capable of processing both direct assessment and pair-wise ranking formats grouped with a user-defined evaluation criteria. On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary LM judges among all tested open evaluator LMs. Our models, code, and data are all publicly available at https://github.com/prometheus-eval/prometheus-eval.

Community

Really cool workπŸ”₯ Would be great to upload the model or build a demo on the hub!

Β·
Paper author

@AdinaY Thanks for your interest in our paper!

You could access the models here:
https://huggingface.co./prometheus-eval/prometheus-8x7b-v2.0
https://huggingface.co./prometheus-eval/prometheus-7b-v2.0

Here's the github repo where we prepared (possibly) every functionality you might need:
https://github.com/prometheus-eval/prometheus-eval

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Β·

@librarian-bot recommend

There's a plain-english rewrite of this paper available here: https://www.aimodels.fyi/papers/arxiv/prometheus-2-open-source-language-model-specialized

Hi here @seungone et al! Congrats on the paper and the release πŸŽ‰

I was just wondering whether you guys did experiment with multi-prompt settings to e.g. critique the last assistant response/s, while using a conversation as input instead of an instruction.

Plus also the fact that some responses to a given instruction can be conditioned by the system prompt, whether you did consider adding a system prompt to the template or if you did run some ablations on that too.

Thanks in advance!

Β·
Paper author

Hey @alvarobartt , thanks for your interest!

We did experiments using MT-Bench that is a multi-turn chat-based benchmark.
All you have to do is append the whole interaction at the {instruction} and insert the latest response in {response} from the template.

Also, we appended the system prompt to the {instruction} placeholder as well. Please let us know your experiences after using Prometheus 2:)

Sign up or log in to comment

Models citing this paper 20

Browse 20 models citing this paper

Datasets citing this paper 3

Spaces citing this paper 9

Collections including this paper 53