HW2-dpo

This model is a fine-tuned version of openai-community/gpt2 on the piqa dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.7148	0.2758	500	0.6866	-2.4692	-2.6207	0.6073	0.1516	-120.4563	-117.0685	-93.4454	-93.3709
0.6747	0.5516	1000	0.6612	-4.3908	-4.7275	0.6495	0.3367	-141.5237	-136.2848	-91.8209	-91.7481
0.6681	0.8275	1500	0.6704	-5.5067	-5.9227	0.6439	0.4161	-153.4764	-147.4433	-87.9307	-87.9442
0.5393	1.1033	2000	0.7086	-6.7527	-7.3196	0.6501	0.5669	-167.4447	-159.9034	-87.8440	-87.9360
0.3132	1.3791	2500	0.7451	-9.5756	-10.2276	0.6520	0.6520	-196.5250	-188.1325	-86.2624	-86.5916
0.3077	1.6549	3000	0.7269	-9.8647	-10.6512	0.6514	0.7865	-200.7605	-191.0236	-89.7969	-90.1133
0.2954	1.9308	3500	0.6959	-9.1185	-9.9717	0.6725	0.8531	-193.9657	-183.5620	-87.1994	-87.4444
0.1295	2.2066	4000	0.8306	-13.7328	-14.8923	0.6650	1.1595	-243.1719	-229.7044	-74.6679	-75.0498
0.0665	2.4824	4500	0.8662	-14.5425	-15.8052	0.6600	1.2626	-252.3006	-237.8021	-75.1185	-75.5275
0.0606	2.7582	5000	0.8593	-13.9253	-15.1905	0.6644	1.2652	-246.1538	-231.6296	-73.5328	-73.9152