HelpSteer2-Preference: Complementing Ratings with Preferences
Abstract
Reward models are critical for aligning models to follow instructions, and are typically trained following one of two popular paradigms: Bradley-Terry style or Regression style. However, there is a lack of evidence that either approach is better than the other, when adequately matched for data. This is primarily because these approaches require data collected in different (but incompatible) formats, meaning that adequately matched data is not available in existing public datasets. To tackle this problem, we release preference annotations (designed for Bradley-Terry training) to complement existing ratings (designed for Regression style training) in the HelpSteer2 dataset. To improve data interpretability, preference annotations are accompanied with human-written justifications. Using this data, we conduct the first head-to-head comparison of Bradley-Terry and Regression models when adequately matched for data. Based on insights derived from such a comparison, we propose a novel approach to combine Bradley-Terry and Regression reward modeling. A Llama-3.1-70B-Instruct model tuned with this approach scores 94.1 on RewardBench, emerging top of more than 140 reward models as of 1 Oct 2024. We also demonstrate the effectiveness of this reward model at aligning models to follow instructions in RLHF. We open-source this dataset (CC-BY-4.0 license) at https://huggingface.co./datasets/nvidia/HelpSteer2 and openly release the trained Reward Model at https://huggingface.co./nvidia/Llama-3.1-Nemotron-70B-Reward
Community
Dataset (CC-BY-4.0 Licensed): https://huggingface.co./datasets/nvidia/HelpSteer2
Using this reward model for RLHF (specifically, REINFORCE), we were able to align a Llama-3.1-70B-Instruct model to reach AlpacaEval 2 LC of 57.6, Arena Hard of 85.0 and GPT-4-Turbo MT-Bench of 8.98, which are known to be predictive of LMSys Chatbot Arena Elo
As of 1 Oct 2024, this aligned model is #1 on all three automatic alignment benchmarks, edging out strong frontier models such as GPT-4o and Claude 3.5 Sonnet.
See details on HelpSteer2-Preference paper at https://arxiv.org/abs/2410.01257 - as a preview, this model can correctly the question How many r in strawberry?
without specialized prompting or additional reasoning tokens:
A sweet question!
Let’s count the “R”s in “strawberry”:
1. S
2. T
3. R
4. A
5. W
6. B
7. E
8. R
9. R
10. Y
There are **3 “R”s** in the word “strawberry”.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Towards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparison (2024)
- Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness (2024)
- On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization (2024)
- From Lists to Emojis: How Format Bias Affects Model Alignment (2024)
- RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend