PEFT
Safetensors
qwen2
alignment-handbook
trl
dpo
Generated from Trainer
khongtrunght commited on
Commit
5405756
·
verified ·
1 Parent(s): 748a602

Model save

Browse files
Files changed (4) hide show
  1. README.md +20 -23
  2. all_results.json +6 -19
  3. train_results.json +6 -6
  4. trainer_state.json +0 -0
README.md CHANGED
@@ -1,12 +1,7 @@
1
  ---
2
  base_model: slm-research-vn/Qwen2-7B-Instruct-SPPO-Function-call-v2.1
3
- datasets:
4
- - slm-research-vn/dpo-format-function-calling-v2
5
- - slm-research-vn/dpo-format-glaive-code-assistant-v3-with-mistral-large-slm-iter4
6
- - argilla/dpo-mix-7k
7
  library_name: peft
8
  tags:
9
- - alignment-handbook
10
  - trl
11
  - dpo
12
  - generated_from_trainer
@@ -20,17 +15,17 @@ should probably proofread and complete it, then remove this comment. -->
20
 
21
  # Qwen2-7B-Instruct-SPPO-Function-call-v2.4
22
 
23
- This model is a fine-tuned version of [slm-research-vn/Qwen2-7B-Instruct-SPPO-Function-call-v2.1](https://huggingface.co/slm-research-vn/Qwen2-7B-Instruct-SPPO-Function-call-v2.1) on the slm-research-vn/dpo-format-function-calling-v2, the slm-research-vn/dpo-format-glaive-code-assistant-v3-with-mistral-large-slm-iter4 and the argilla/dpo-mix-7k datasets.
24
  It achieves the following results on the evaluation set:
25
- - Loss: 0.4345
26
- - Rewards/chosen: 1.3033
27
- - Rewards/rejected: 0.2776
28
- - Rewards/accuracies: 0.8185
29
- - Rewards/margins: 1.0258
30
- - Logps/rejected: -333.5228
31
- - Logps/chosen: -261.0424
32
- - Logits/rejected: -0.7224
33
- - Logits/chosen: -0.7089
34
 
35
  ## Model description
36
 
@@ -49,7 +44,7 @@ More information needed
49
  ### Training hyperparameters
50
 
51
  The following hyperparameters were used during training:
52
- - learning_rate: 5e-07
53
  - train_batch_size: 1
54
  - eval_batch_size: 1
55
  - seed: 42
@@ -67,13 +62,15 @@ The following hyperparameters were used during training:
67
 
68
  | Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
69
  |:-------------:|:------:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
70
- | 0.6782 | 0.1270 | 100 | 0.6611 | 0.1038 | 0.0272 | 0.8000 | 0.0766 | -338.5302 | -285.0340 | -0.7425 | -0.7284 |
71
- | 0.5811 | 0.2540 | 200 | 0.5409 | 0.5575 | 0.1395 | 0.8370 | 0.4180 | -336.2845 | -275.9589 | -0.7306 | -0.6945 |
72
- | 0.5484 | 0.3811 | 300 | 0.4777 | 0.9393 | 0.2286 | 0.8000 | 0.7107 | -334.5019 | -268.3231 | -0.7283 | -0.7031 |
73
- | 0.4531 | 0.5081 | 400 | 0.4535 | 1.1283 | 0.2592 | 0.8296 | 0.8690 | -333.8891 | -264.5439 | -0.7170 | -0.6879 |
74
- | 0.4577 | 0.6351 | 500 | 0.4415 | 1.2504 | 0.2849 | 0.8148 | 0.9655 | -333.3753 | -262.1006 | -0.7146 | -0.6865 |
75
- | 0.4715 | 0.7621 | 600 | 0.4364 | 1.2963 | 0.2864 | 0.8148 | 1.0099 | -333.3469 | -261.1842 | -0.7175 | -0.6913 |
76
- | 0.4508 | 0.8892 | 700 | 0.4348 | 1.2990 | 0.2819 | 0.8222 | 1.0172 | -333.4369 | -261.1283 | -0.7185 | -0.6937 |
 
 
77
 
78
 
79
  ### Framework versions
 
1
  ---
2
  base_model: slm-research-vn/Qwen2-7B-Instruct-SPPO-Function-call-v2.1
 
 
 
 
3
  library_name: peft
4
  tags:
 
5
  - trl
6
  - dpo
7
  - generated_from_trainer
 
15
 
16
  # Qwen2-7B-Instruct-SPPO-Function-call-v2.4
17
 
18
+ This model is a fine-tuned version of [slm-research-vn/Qwen2-7B-Instruct-SPPO-Function-call-v2.1](https://huggingface.co/slm-research-vn/Qwen2-7B-Instruct-SPPO-Function-call-v2.1) on the None dataset.
19
  It achieves the following results on the evaluation set:
20
+ - Loss: 0.3144
21
+ - Rewards/chosen: 1.9968
22
+ - Rewards/rejected: 0.2178
23
+ - Rewards/accuracies: 0.8844
24
+ - Rewards/margins: 1.7789
25
+ - Logps/rejected: -267.1377
26
+ - Logps/chosen: -202.5171
27
+ - Logits/rejected: -0.6136
28
+ - Logits/chosen: -0.6052
29
 
30
  ## Model description
31
 
 
44
  ### Training hyperparameters
45
 
46
  The following hyperparameters were used during training:
47
+ - learning_rate: 1e-06
48
  - train_batch_size: 1
49
  - eval_batch_size: 1
50
  - seed: 42
 
62
 
63
  | Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
64
  |:-------------:|:------:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
65
+ | 0.6534 | 0.1020 | 100 | 0.6139 | 0.2871 | 0.0961 | 0.7572 | 0.1911 | -269.5727 | -236.7095 | -0.6762 | -0.6844 |
66
+ | 0.4902 | 0.2041 | 200 | 0.4530 | 1.4421 | 0.5735 | 0.8064 | 0.8685 | -260.0234 | -213.6108 | -0.6513 | -0.6502 |
67
+ | 0.391 | 0.3061 | 300 | 0.3935 | 1.9109 | 0.6931 | 0.8382 | 1.2178 | -257.6317 | -204.2344 | -0.6321 | -0.6298 |
68
+ | 0.3497 | 0.4082 | 400 | 0.3633 | 1.9715 | 0.5740 | 0.8468 | 1.3975 | -260.0141 | -203.0221 | -0.6323 | -0.6313 |
69
+ | 0.3378 | 0.5102 | 500 | 0.3421 | 2.0346 | 0.4602 | 0.8699 | 1.5744 | -262.2907 | -201.7610 | -0.6197 | -0.6103 |
70
+ | 0.2904 | 0.6122 | 600 | 0.3287 | 1.9449 | 0.3083 | 0.8757 | 1.6366 | -265.3278 | -203.5543 | -0.6221 | -0.6159 |
71
+ | 0.3053 | 0.7143 | 700 | 0.3207 | 1.9933 | 0.2606 | 0.8902 | 1.7327 | -266.2818 | -202.5857 | -0.6162 | -0.6111 |
72
+ | 0.2655 | 0.8163 | 800 | 0.3158 | 1.9845 | 0.2262 | 0.8815 | 1.7583 | -266.9698 | -202.7614 | -0.6127 | -0.6026 |
73
+ | 0.2943 | 0.9184 | 900 | 0.3144 | 1.9968 | 0.2178 | 0.8844 | 1.7789 | -267.1377 | -202.5171 | -0.6136 | -0.6052 |
74
 
75
 
76
  ### Framework versions
all_results.json CHANGED
@@ -1,22 +1,9 @@
1
  {
2
- "epoch": 0.9996824388694824,
3
- "eval_logits/chosen": -0.7089246511459351,
4
- "eval_logits/rejected": -0.7224333882331848,
5
- "eval_logps/chosen": -261.0423889160156,
6
- "eval_logps/rejected": -333.52276611328125,
7
- "eval_loss": 0.4344652593135834,
8
- "eval_rewards/accuracies": 0.8185185194015503,
9
- "eval_rewards/chosen": 1.3033398389816284,
10
- "eval_rewards/margins": 1.0257779359817505,
11
- "eval_rewards/rejected": 0.27756187319755554,
12
- "eval_runtime": 185.0205,
13
- "eval_samples": 2155,
14
- "eval_samples_per_second": 11.647,
15
- "eval_steps_per_second": 1.459,
16
  "total_flos": 0.0,
17
- "train_loss": 0.5202696668753327,
18
- "train_runtime": 5902.1361,
19
- "train_samples": 25187,
20
- "train_samples_per_second": 4.267,
21
- "train_steps_per_second": 0.133
22
  }
 
1
  {
2
+ "epoch": 1.0,
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  "total_flos": 0.0,
4
+ "train_loss": 0.39260031933687173,
5
+ "train_runtime": 7916.79,
6
+ "train_samples": 31353,
7
+ "train_samples_per_second": 3.96,
8
+ "train_steps_per_second": 0.124
9
  }
train_results.json CHANGED
@@ -1,9 +1,9 @@
1
  {
2
- "epoch": 0.9996824388694824,
3
  "total_flos": 0.0,
4
- "train_loss": 0.5202696668753327,
5
- "train_runtime": 5902.1361,
6
- "train_samples": 25187,
7
- "train_samples_per_second": 4.267,
8
- "train_steps_per_second": 0.133
9
  }
 
1
  {
2
+ "epoch": 1.0,
3
  "total_flos": 0.0,
4
+ "train_loss": 0.39260031933687173,
5
+ "train_runtime": 7916.79,
6
+ "train_samples": 31353,
7
+ "train_samples_per_second": 3.96,
8
+ "train_steps_per_second": 0.124
9
  }
trainer_state.json CHANGED
The diff for this file is too large to render. See raw diff