sft_dpo_fs

This model is a fine-tuned version of mistralai/Mistral-Nemo-Instruct-2407 on the heat_transfer_dpo dataset. It achieves the following results on the evaluation set:

Loss: 0.1535
Rewards/chosen: 17.2823
Rewards/rejected: 11.3004
Rewards/accuracies: 0.9610
Rewards/margins: 5.9819
Logps/chosen: -2.2063
Logps/rejected: -60.6033
Logits/chosen: 0.0035
Logits/rejected: -0.0076

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 4
eval_batch_size: 4
seed: 42
distributed_type: multi-GPU
num_devices: 2
total_train_batch_size: 8
total_eval_batch_size: 8
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/chosen	Logps/rejected	Logits/chosen	Logits/rejected
0.3835	0.0533	60	0.3287	17.1482	15.7095	0.9280	1.4386	-3.5472	-16.5118	-0.5827	-0.5914
0.2552	0.1067	120	0.1900	17.1335	13.7535	0.9320	3.3799	-3.6944	-36.0722	-0.2065	-0.2218
0.2362	0.16	180	0.2024	17.0614	11.9722	0.9510	5.0892	-4.4150	-53.8850	-0.1087	-0.1222
0.1781	0.2133	240	0.1546	17.0620	12.2862	0.9500	4.7758	-4.4089	-50.7448	-0.1243	-0.1381
0.265	0.2667	300	0.1536	17.2493	12.6444	0.9440	4.6050	-2.5355	-47.1637	-0.1744	-0.1856
0.1605	0.32	360	0.3194	17.3612	12.2655	0.9210	5.0958	-1.4165	-50.9525	-0.1062	-0.1173
0.2894	0.3733	420	0.1679	17.3116	12.2496	0.9450	5.0620	-1.9131	-51.1113	-0.0905	-0.1026
0.1149	0.4267	480	0.2951	17.0540	11.9844	0.9230	5.0696	-4.4890	-53.7628	-0.0770	-0.0883
0.0384	0.48	540	0.1739	17.2042	12.1334	0.9490	5.0708	-2.9873	-52.2731	-0.0512	-0.0612
0.4008	0.5333	600	0.1706	17.2853	11.6981	0.9470	5.5872	-2.1760	-56.6266	-0.0358	-0.0469
0.1678	0.5867	660	0.2050	17.2021	11.5656	0.9450	5.6365	-3.0082	-57.9516	-0.0160	-0.0270
0.2272	0.64	720	0.1402	17.3928	11.7696	0.9520	5.6233	-1.1005	-55.9117	-0.0229	-0.0322
0.1915	0.6933	780	0.2441	17.3947	11.7656	0.9320	5.6290	-1.0823	-55.9507	-0.0166	-0.0266
0.0635	0.7467	840	0.1689	17.3812	11.5343	0.9450	5.8469	-1.2169	-58.2643	-0.0111	-0.0217
0.1703	0.8	900	0.1400	17.3271	11.3817	0.9610	5.9455	-1.7577	-59.7906	0.0002	-0.0105
0.1138	0.8533	960	0.1441	17.3149	11.3432	0.9630	5.9718	-1.8795	-60.1756	0.0015	-0.0094
0.0513	0.9067	1020	0.1412	17.3211	11.3263	0.9610	5.9948	-1.8178	-60.3445	0.0045	-0.0065
0.1189	0.96	1080	0.1508	17.2887	11.3001	0.9610	5.9886	-2.1420	-60.6061	0.0074	-0.0036

Framework versions

PEFT 0.12.0
Transformers 4.46.0
Pytorch 2.4.0+cu121
Datasets 2.21.0
Tokenizers 0.20.1

Howard881010
/

heat_transfer_sft_dpo_fs

sft_dpo_fs

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for Howard881010/heat_transfer_sft_dpo_fs

Evaluation results