distily_bench_obj_cross_v2.4

This student model is distilled from the teacher model roneneldan/TinyStories-33M using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 49928.3438
eval_frwikippl: 60082.1211
eval_zhwikippl: 75499.0547
eval_tinystoriesppl: 44922.0352
eval_loss: 6.1235
eval_runtime: 13.0526
eval_samples_per_second: 76.613
eval_steps_per_second: 9.577

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 0.0001
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.0568 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		169.9865	47377.9414					3.9789	4998.1294
0	0	77152.8516	72247.2109	6.5107	12.9698	77.102	9.638	77917.4219	77892.9766
500	0.0404	49943.8203	60082.1211	6.1230	12.9612	77.153	9.644	44951.7734	75499.0547
1000	0.0808	49943.8203	60082.1211	6.1230	12.975	77.071	9.634	44951.7734	75499.0547
1500	0.1212	49943.8203	60082.1211	6.1230	12.9603	77.159	9.645	44944.3164	75499.0547
2000	0.1616	49943.8203	60082.1211	6.1235	12.9533	77.2	9.65	44922.0352	75499.0547
2500	0.2020	49943.8203	60082.1211	6.1233	12.9601	77.16	9.645	44922.0352	75499.0547
3000	0.2424	49920.5820	60082.1211	6.1233	12.9557	77.186	9.648	44907.2148	75499.0547
3500	0.2828	49920.5820	60082.1211	6.1233	12.9662	77.124	9.64	44907.2148	75499.0547
4000	0.3232	49920.5820	60082.1211	6.1233	12.9824	77.027	9.628	44907.2148	75499.0547
4500	0.3636	49920.5820	60082.1211	6.1233	12.9965	76.944	9.618	44907.2148	75499.0547
5000	0.4040	49928.3438	60082.1211	6.1235	13.1112	76.271	9.534	44922.0352	75499.0547
5500	0.4444	49928.3438	60082.1211	6.1235	13.1865	75.835	9.479	44922.0352	75499.0547
6000	0.4848	49928.3438	60082.1211	6.1235	13.0376	76.701	9.588	44922.0352	75499.0547
6500	0.5253	49928.3438	60082.1211	6.1235	12.9934	76.962	9.62	44922.0352	75499.0547
7000	0.5657	49928.3438	60082.1211	6.1235	12.9741	77.077	9.635	44922.0352	75499.0547
7500	0.6061	49928.3438	60082.1211	6.1235	13.0011	76.917	9.615	44922.0352	75499.0547
8000	0.6465	49928.3438	60082.1211	6.1235	13.021	76.799	9.6	44922.0352	75499.0547
8500	0.6869	49928.3438	60082.1211	6.1235	13.023	76.787	9.598	44922.0352	75499.0547
9000	0.7273	49928.3438	60082.1211	6.1235	12.9717	77.091	9.636	44922.0352	75499.0547
9500	0.7677	49928.3438	60082.1211	6.1235	13.0526	76.613	9.577	44922.0352	75499.0547
10000	0.8081	49928.3438	60082.1211	6.1235	12.9964	76.944	9.618	44922.0352	75499.0547
10500	0.8485	49928.3438	60082.1211	6.1235	12.9662	77.123	9.64	44922.0352	75499.0547
11000	0.8889	49928.3438	60082.1211	6.1235	12.9957	76.949	9.619	44922.0352	75499.0547
11500	0.9293	49928.3438	60082.1211	6.1235	12.9518	77.209	9.651	44922.0352	75499.0547
12000	0.9697	49928.3438	60082.1211	6.1235	12.9583	77.171	9.646	44922.0352	75499.0547
12375	1.0	49928.3438	60082.1211	6.1235	13.0092	76.869	9.609	44922.0352	75499.0547

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0

lapp0
/

distily_bench_obj_cross_v2.4