distily_bench_gpt2_optim

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

  • eval_enwikippl: 524.7870
  • eval_frwikippl: 3705.5625
  • eval_zhwikippl: 6035.2861
  • eval_loss: 2370.7361
  • eval_runtime: 21.6322
  • eval_samples_per_second: 46.227
  • eval_steps_per_second: 11.557

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • distillation_objective: LinearObjective(logits_weight=1, logits_loss_fn=<function kl_divergence_loss at 0x7f57c4b07910>, activations_weight=10, activations_loss_fn=<function kl_divergence_loss at 0x7f57c4b07910>, attentions_weight=0, attentions_loss_fn=<function mse_loss at 0x7f57c4b07880>)
  • train_embeddings: True
  • learning_rate: 4e-05
  • train_batch_size: 4
  • eval_batch_size: 4
  • seed: 42
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: constant
  • num_epochs: 1.0

Resource Usage

Peak GPU Memory: 4.5067 GB

Eval-Phase Metrics

step epoch enwikippl frwikippl loss runtime samples_per_second steps_per_second zhwikippl
teacher eval 30.2385 57.2728 18.1772
0 0 55339.3672 57682.5742 31197.1836 21.4398 46.642 11.661 57080.2930
500 0.0808 1545.6934 7685.4297 3209.9360 21.4847 46.545 11.636 63830.4023
1000 0.1616 1108.6847 5659.8701 2933.1360 21.4559 46.607 11.652 31166.1797
1500 0.2424 913.3565 4893.8623 2798.0161 21.5956 46.306 11.576 23215.4258
2000 0.3232 813.5310 4763.6436 2700.0161 21.635 46.221 11.555 22568.9238
2500 0.4040 747.3608 4565.6851 2631.0720 21.5442 46.416 11.604 18090.1602
3000 0.4848 711.6094 4255.0127 2579.2639 21.7116 46.058 11.515 16199.8096
3500 0.5657 666.4665 4117.3369 2530.9441 21.5886 46.321 11.58 16435.1426
4000 0.6465 638.0192 4058.8262 2500.0801 21.4712 46.574 11.643 16069.4648
4500 0.7273 597.0923 4013.0125 2459.4241 21.7093 46.063 11.516 12965.0762
5000 0.8081 567.6912 3822.9963 2424.4800 21.5309 46.445 11.611 10275.5850
5500 0.8889 548.5159 3864.8674 2399.5359 21.6408 46.209 11.552 8114.6914
6000 0.9697 539.3817 3793.8606 2379.3601 21.5636 46.374 11.594 6467.9736
6187 0.9999 524.7870 3705.5625 2370.7361 21.6322 46.227 11.557 6035.2861

Framework versions

  • Distily 0.2.0
  • Transformers 4.44.0
  • Pytorch 2.3.0
  • Datasets 2.20.0
Downloads last month
4
Safetensors
Model size
124M params
Tensor type
BF16
·
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for lapp0/distily_bench_gpt2_linear_objectives

Quantized
(54)
this model