This is a process-supervised reward (PRM) trained on Mistral-generated data from the project RLHFlow/RLHF-Reward-Modeling

The model is trained from meta-llama/Llama-3.1-8B-Instruct on RLHFlow/Deepseek-ORM-Data for 1 epochs. We use a global batch size of 32 and a learning rate of 2e-6, where we pack the samples and split them into chunks of 8192 token. See more training details at https://github.com/RLHFlow/Online-RLHF/blob/main/math/llama-3.1-prm.yaml .

BoN evaluation result for Mistral generator:

Model Method GSM8K MATH
Mistral-7B Pass@1 77.9 28.4
Mistral-7B Majority Voting@1024 84.2 36.8
Mistral-7B Mistral-ORM@1024 90.1 43.6
Mistral-7B Mistral-PRM@1024 92.4 46.3

Scaling the inference sampling to N=1024 for Deepseek generator:

Model Method GSM8K MATH
Deepseek-7B Pass@1 83.9 38.4
Deepseek-7B Majority Voting@1024 89.7 57.4
Deepseek-7B Deepseek-ORM@1024 93.4 52.4
Deepseek-7B Deepseek-PRM@1024 93.0 58.1
Deepseek-7B Mistral-ORM@1024 (OOD) 90.3 54.9
Deepseek-7B Mistral-PRM@1024 (OOD) 91.9 56.9

Visualization

image/png

Usage

See https://github.com/RLHFlow/RLHF-Reward-Modeling/tree/main/math for detailed examples.

Citation

The automatic annotation was proposed in the Math-shepherd paper:

@inproceedings{wang2024math,
  title={Math-shepherd: Verify and reinforce llms step-by-step without human annotations},
  author={Wang, Peiyi and Li, Lei and Shao, Zhihong and Xu, Runxin and Dai, Damai and Li, Yifei and Chen, Deli and Wu, Yu and Sui, Zhifang},
  booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={9426--9439},
  year={2024}
}

If you find the training recipe useful, please consider cite it as follows.

@misc{xiong2024rlhflowmath,
      author={Wei Xiong and Hanning Zhang and Nan Jiang and Tong Zhang},
  title = {An Implementation of Generative PRM},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/RLHFlow/RLHF-Reward-Modeling}}
}
Downloads last month
82
Safetensors
Model size
8.03B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for RLHFlow/Llama3.1-8B-ORM-Deepseek-Data

Quantizations
1 model

Collection including RLHFlow/Llama3.1-8B-ORM-Deepseek-Data