ReasonEval-34B Model Card

Model Description

ReasonEval-34B is a 34B parameter decoder-only language model fine-tuned from llemma_34b. Given a mathematical problem and the solution, ReasonEval-34B assesses the problem-solving process in a step-by-step format from the following perspectives:

Validity: The step contains no mistakes in calculation and logic.
Redundancy: The step lacks utility in solving the problem but is still valid.

With ReasonEval, you can

📏 quantify the quality of reasoning steps free of human or close-source models.
🤖 find the potential invalid or redundant steps in the solutions even with the correct results.
🛠️ select high-quality training data for downstream tasks (e.g., fine-tuning).

Model Details

Model type: ReasonEval-34B's architecture is identical to llemma_34b, except that the classification head for next-token prediction is replaced with a classification head for outputting the possibilities of each class of reasong steps.
Language(s): English
Paper: Evaluating Mathematical Reasoning Beyond Accuracy
Github: https://github.com/GAIR-NLP/ReasonEval
Finetuned from model: https://huggingface.co./EleutherAI/llemma_34b
Fine-tuning Data: PRM800K

For detailed instructions on how to use the ReasonEval-34B model, visit our GitHub repository at https://github.com/GAIR-NLP/ReasonEval.

How to Cite

@article{xia2024evaluating,
        title={Evaluating Mathematical Reasoning Beyond Accuracy}, 
        author={Xia, Shijie and Li, Xuefeng and Liu, Yixin and Wu, Tongshuang and Liu, Pengfei},
        journal={arXiv preprint arXiv:2404.05692},
        year={2024},
}