--- license: apache-2.0 --- # EurusPRM-Stage1 ## Links - 📜 [Blog](https://curvy-check-498.notion.site/Process-Reinforcement-through-Implicit-Rewards-15f4fcb9c42180f1b498cc9b2eaf896f) - 🤗 [PRIME Collection](https://huggingface.co./PRIME-RL) - 🤗 [Training Data](https://huggingface.co./datasets/PRIME-RL/EurusPRM-Stage2-Data) ## Introduction EurusPRM-Stage1 is trained using **[Implicit PRM](https://arxiv.org/abs/2412.01981)**, which obtains free process rewards at no additional cost but just needs to simply train an ORM on the cheaper response-level labels. During inference, implicit process rewards are obtained by forward passing and calculating the log-likelihood ratio on each step. prm The key ingredient of Implicit PRM is the reward representation, as demonstrated below: