distily_bench_obj_cross_v2.11_gpt2

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

  • eval_enwikippl: 840.1149
  • eval_frwikippl: 528.4605
  • eval_zhwikippl: 126.6330
  • eval_tinystoriesppl: 1037.4924
  • eval_loss: 0.5100
  • eval_runtime: 21.5094
  • eval_samples_per_second: 46.491
  • eval_steps_per_second: 11.623

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
  • train_embeddings: True
  • learning_rate: 4e-05
  • train_batch_size: 1
  • eval_batch_size: 4
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 1.0

Resource Usage

Peak GPU Memory: 3.9285 GB

Eval-Phase Metrics

step epoch enwikippl frwikippl loss runtime samples_per_second steps_per_second tinystoriesppl zhwikippl
teacher eval 270.2348 76.8142 671.1238 22.8030
0 0 120078.375 1867851235328.0 19.4492 21.0652 47.472 11.868 72.8770 4013754155008.0
5000 0.0505 1216.0441 888.1107 0.7144 21.4135 46.7 11.675 1267.6812 332.8297
10000 0.1010 1162.2788 799.4963 0.6619 21.4269 46.67 11.668 1249.7319 438.5025
15000 0.1515 980.3101 668.6794 0.6395 21.4739 46.568 11.642 1056.4025 425.3380
20000 0.2020 1064.2865 759.8051 0.6318 21.4643 46.589 11.647 1151.2905 311.5830
25000 0.2525 916.0289 621.8902 0.5662 21.1368 47.311 11.828 1071.6635 190.3806
30000 0.3030 891.1293 582.2575 0.5445 21.4338 46.655 11.664 1072.1951 208.7082
35000 0.3535 886.6196 544.0957 0.5381 21.5335 46.439 11.61 1057.8008 142.8915
40000 0.4040 880.1868 549.4098 0.5349 21.4687 46.58 11.645 1076.1021 142.8439
45000 0.4545 868.9573 564.4311 0.5323 21.4349 46.653 11.663 1042.4788 161.4311
50000 0.5051 877.1919 541.3246 0.5320 21.548 46.408 11.602 1058.0631 167.7873
55000 0.5556 869.4625 543.6743 0.5313 21.4821 46.55 11.638 1043.7725 163.6863
60000 0.6061 872.2788 553.3121 0.5305 21.4316 46.66 11.665 1068.5228 141.9700
65000 0.6566 833.5512 524.0497 0.5156 21.1637 47.251 11.813 1028.6963 137.2677
70000 0.7071 837.5645 523.4596 0.5133 21.4101 46.707 11.677 1031.1652 124.3812
75000 0.7576 847.7309 523.0175 0.5129 21.1745 47.227 11.807 1047.8357 130.6221
80000 0.8081 843.6693 534.2609 0.5125 21.388 46.755 11.689 1040.4556 125.4979
85000 0.8586 843.2120 524.1607 0.5106 21.4851 46.544 11.636 1042.5220 126.1609
90000 0.9091 842.1672 529.2425 0.5101 21.4494 46.621 11.655 1040.6277 126.7345
95000 0.9596 838.0835 528.3859 0.5099 21.1216 47.345 11.836 1034.5377 126.5655
99000 1.0 840.1149 528.4605 0.5100 21.5094 46.491 11.623 1037.4924 126.6330

Framework versions

  • Distily 0.2.0
  • Transformers 4.44.0
  • Pytorch 2.3.0
  • Datasets 2.21.0
Downloads last month
6
Safetensors
Model size
124M params
Tensor type
BF16
·
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for lapp0/distily_bench_obj_cross_v2.11_gpt2

Quantized
(53)
this model