Model Description

This model is fine-tuned on reward modeling data and has undergone two stages of training: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). As a result, it is a post-DPO model optimized for reasoning and text generation tasks.

chat_message = [
  {"role": "user", "content": ...},
  {"role": "reason", "content": ...},
  {"role": "assistant", "content": ...},
]

Intended Use

While this model is specifically designed for reward modeling tasks, it also demonstrates adaptability to general-purpose tasks. Notably, it exhibits a degree of correctness and reliability across various applications.

Limitations

  • The model’s performance may vary depending on the domain and specificity of the input.
  • It may inherit biases present in the training data.

Code and Resources

The code and additional resources for this model are available on GitHub.

Downloads last month
12
Safetensors
Model size
14.8B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for jiulaikankan/Qwen2.5-14B-ReasonGenRM

Base model

Qwen/Qwen2.5-14B
Finetuned
(118)
this model