Junrulu
/

Reproduced-tulu2-dpo-13b

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Reproduced-tulu2-dpo-13b / README.md

Junrulu's picture

Update README.md

f5b52f1 verified 8 months ago

|

history blame contribute delete

2.09 kB

	---
	model-index:
	- name: Junrulu/Reproduced-tulu2-dpo-13b
	results: []
	datasets:
	- HuggingFaceH4/ultrafeedback_binarized
	- Junrulu/Reproduced-tulu2-test-sets
	language:
	- en
	base_model: allenai/tulu-2-13b
	---

	# Model Card for Reproduced Tulu2 DPO 13B

	This repository provides a reproduction version of Tulu2-DPO-13B finetuned upon [Tulu2-13B](https://huggingface.co./allenai/tulu-2-13b) and [Ultrafeedback](https://huggingface.co./datasets/HuggingFaceH4/ultrafeedback_binarized). Therefore, we obey all licenses mentioned in Tulu2's work. Check our codes for more details: https://github.com/LuJunru/LLM_Finetune/tree/DPO, which is built with [TRL](https://github.com/huggingface/trl/tree/main).

	## Performance

	\| Model \| Size \| Alignment \| MT-Bench (score) \| AlpacaEval 2.0 (win rate %) \|
	\|-------------\|-----\|----\|---------------\|--------------\|
	\| Tulu-v2-13b 🐪 \| 13B \| SFT \| 5.79 \| 2.61 \|
	\| Tulu-v2-dpo-13b 🐪 \| 13B \| DPO \| 6.06 \| 6.96 \|
	\| Reproduced-tulu2-dpo-13b \| 13B \| DPO \| 6.27 \| 6.71 \|

	## Input Format

	The model is trained to use the following format (note the newlines):
	```
	<\|user\|>
	Your message here!
	<\|assistant\|>
	```

	For best results, format all inputs in this manner. Make sure to include a newline after `<\|assistant\|>`, this can affect generation quality quite a bit. Note: if fine-tuning with this chat template, ensure to evaluate and test with the chat template. Otherwise, fine-tining without the template if you choose to not use template during testing. Any mismatch of the chatting template between training and testing phases can obviously dampen the final performance.

	## Training hyperparameters

	The following hyperparameters were used during DPO training:
	- DPO beta: 0.1
	- learning_rate: 1e-6 * sqrt(Num of Nodes)
	- total_train_batch_size: 128 * Num of Nodes
	- optimizer: AdamW with beta1 0.9, beta2 0.999 and epsilon 1e-8
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_ratio: 0.1
	- Weight Decay: 0.0
	- num_epochs: 3.0
	- Specifically add above input format over training samples