Starling-RM-7B-alpha

  • Developed by: Banghua Zhu * , Evan Frick * , Tianhao Wu * , Hanlin Zhu and Jiantao Jiao.
  • Model type: Language Model finetuned with RLHF / RLAIF
  • License: Non commercial license
  • Finetuned from model: Openchat 3.5 (based on Mistral-7B-v0.1)

We introduce Starling-7B, an open large language model (LLM) trained by Reinforcement Learning from AI Feedback (RLAIF). The model harnesses the power of our new GPT-4 labeled ranking dataset, berkeley-nest/Nectar, and our new reward training and policy tuning pipeline. Starling-7B-alpha scores 8.09 in MT Bench with GPT-4 as a judge, outperforming every model to date on MT-Bench except for OpenAI's GPT-4 and GPT-4 Turbo. We release the ranking dataset Nectar, the reward model Starling-RM-7B-alpha and the language model Starling-LM-7B-alpha on HuggingFace, and an online demo in LMSYS Chatbot Arena. Stay tuned for our forthcoming code and paper, which will provide more details on the whole process.

Starling-LM-7B-alpha is a language model trained from Openchat 3.5 with reward model berkeley-nest/Starling-RM-7B-alpha and policy optimization method advantage-induced policy alignment (APA). The evaluation results are listed below.

Model Tuning Method MT Bench AlpacaEval MMLU
GPT-4-Turbo ? 9.32 97.70
GPT-4 SFT + PPO 8.99 95.28 86.4
Starling-7B C-RLFT + APA 8.09 91.99 63.9
Claude-2 ? 8.06 91.36 78.5
GPT-3.5-Turbo ? 7.94 89.37 70
Claude-1 ? 7.9 88.39 77
Tulu-2-dpo-70b SFT + DPO 7.89 95.1
Openchat-3.5 C-RLFT 7.81 88.51 64.3
Zephyr-7B-beta SFT + DPO 7.34 90.60 61.4
Llama-2-70b-chat-hf SFT + PPO 6.86 92.66 63
Neural-chat-7b-v3-1 SFT + DPO 6.84 84.53 62.4
Tulu-2-dpo-7b SFT + DPO 6.29 85.1

For more detailed discussions, please check out our blog post, and stay tuned for our upcoming code and paper!

Uses

Our model follows the exact chat template and usage as Openchat 3.5. Please refer to their model card for more details. In addition, our model is hosted on LMSYS Chatbot Arena for free test.

License

The dataset, model and online demo is a research preview intended for non-commercial use only, subject to the data distillation License of LLaMA, Terms of Use of the data generated by OpenAI, and Privacy Practices of ShareGPT. Please contact us if you find any potential violation.

Acknowledgment

We would like to thank Wei-Lin Chiang from Berkeley for detailed feedback of the blog and the projects. We would like to thank the LMSYS Organization for their support of lmsys-chat-1M dataset, evaluation and online demo. We would like to thank the open source community for their efforts in providing the datasets and base models we used to develope the project, including but not limited to Anthropic, Llama, Mistral, Hugging Face H4, LMSYS, OpenChat, OpenBMB, Flan and ShareGPT.

Citation

@misc{starling2023,
    title = {Starling-7B: Improving LLM Helpfulness & Harmlessness with RLAIF},
    url = {},
    author = {Zhu, Banghua and Frick, Evan and Wu, Tianhao and Zhu, Hanlin and Jiao, Jiantao},
    month = {November},
    year = {2023}
}
Downloads last month
11
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train LoneStriker/Starling-LM-7B-alpha-4.0bpw-h6-exl2