distily_bench_gpt2_activation_loss_b

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

Training procedure

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=2.0, loss_fn=ce, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 4e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
num_epochs: 1.0

Peak GPU Memory: 8.0903 GB

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	zhwikippl
teacher eval		30.2086	57.2728					18.1784
0	0	55429.6875	57698.8047	24.5150	17.2943	57.823	7.228	56988.9141
1000	0.0808	713.7677	4453.7666	20.3910	17.3531	57.627	7.203	17866.8926
2000	0.1616	521.2028	3308.0386	20.2010	17.3798	57.538	7.192	2471.2515
3000	0.2424	433.2541	2722.2993	20.1000	17.3672	57.58	7.197	1283.4985
4000	0.3232	387.5081	2569.3728	20.0170	17.3651	57.587	7.198	1167.0867
5000	0.4040	332.2302	2197.1006	19.9310	17.283	57.86	7.233	1141.8051
6000	0.4848	292.5944	1835.8154	19.8590	17.2939	57.824	7.228	905.3102
7000	0.5657	266.3748	1648.5508	19.7820	17.3184	57.742	7.218	844.8045
8000	0.6465	244.8321	1513.9550	19.7310	17.3028	57.794	7.224	1150.9904
9000	0.7273	225.9773	1391.1320	19.6630	17.2806	57.868	7.234	821.2236
10000	0.8081	209.6788	1266.0754	19.6040	17.3446	57.655	7.207	718.9499
11000	0.8889	196.7588	1248.5234	19.5620	17.3611	57.6	7.2	611.5998
12000	0.9697	179.4194	1137.2484	19.5120	17.3767	57.548	7.194	572.3267
12375	1.0	175.7241	1080.9574	19.4920	17.3076	57.778	7.222	584.9987