Comparisons of timm Optimizers w/ Caution

This repo contains summaries of several sets of experiments comparing a number of optimizers with and without caution (https://huggingface.co./papers/2411.16085) enabled.

The runs were all performed training a smaller ViT (vit_wee_patch16_reg1_gap_256) for 200 epochs (10M samples seen) from scratch on the timm 'mini-imagenet' dataset, a 100 class subset of imagenet with same image sizes as originals.

So far I have results for adamw, laprop, and mars (https://huggingface.co./papers/2411.10438). You can find full results in sub-folders by optimizer names. In all of these runs, the experiments with 'c' prefix in the name have caution enabled.

This is what the 'caution' addition looks like in an optimizer:

    mask = (exp_avg * grad > 0).to(grad.dtype)
    mask.div_(mask.mean().clamp_(min=1e-3))
    exp_avg = exp_avg * mask

Train args:

./distributed_train.sh 2 --dataset hfds/timm/mini-imagenet --num-classes 100 --model vit_wee_patch16_reg1_gap_256 -j 8 --epochs 200 --warmup-prefix --sched-on-updates --warmup-lr 0 --mixup .2 --model-ema --model-ema-decay 0.999 --model-ema-warmup --aa rand-m9-mstd0.5-inc1 --remode pixel --reprob 0.25 --amp --weight-decay .05 --drop 0.1 --drop-path .1 -b 288 --opt cadamw --lr 1e-3

LaProp

optim	best_epoch	train_loss	eval_loss	eval_top1	eval_top5	lr
claprop, lr=1e-03	204.0	2.2173619270324707	1.0931779468536378	73.920000390625	91.33000009765624	0.0
claprop, lr=5e-04	183.0	2.262192726135254	1.0912627222061158	73.77000073242188	91.22000260009766	1.3478660293113704e-05
laprop, lr=5e-04	198.0	2.2425642013549805	1.1426102781295775	71.73000213623047	90.55000146484376	1.109508849230001e-06
laprop, lr=1e-03	179.0	2.290040969848633	1.168387135314941	71.15000104980469	90.18000189208983	3.806023374435663e-05
claprop, lr=2e-04	195.0	2.546172380447388	1.2475446645736694	68.30000163574219	89.15000153808593	9.97634228344235e-07
laprop, lr=2e-04	204.0	2.6702351570129395	1.309178423690796	67.07999990234374	88.67000270996094	0.0
claprop, lr=2e-03	193.0	2.678058862686157	1.5239886917114258	62.08000177001953	84.8	1.4890673845226132e-05
laprop, lr=2e-03	200.0	2.70467209815979	1.522907255935669	61.46000135498047	85.28000162353516	1.9732715717284413e-06

LaProp Top-1 Evaluation Accuracy on Mini-ImageNet

LaProp Train Loss

AdamW

optim	best_epoch	train_loss	eval_loss	eval_top1	eval_top5
cadamw, lr=1e-03	184.0	2.2688851356506348	1.0868136840820313	73.52000141601563	91.60000036621092
cadamw, lr=5e-04	199.0	2.163278102874756	1.0976034646987916	73.3900005859375	91.31000137939454
cadamw, lr=1e-03, clip grads	203.0	2.1360626220703125	1.1043113907814026	73.33000158691407	91.41000042724608
adamw, lr=1e-03, clip grads	195.0	2.2746386528015137	1.142998440361023	72.11000151367188	90.47000052490236
adamw, lr=5e-04	185.0	2.3040246963500977	1.1535791856765747	71.50000120849609	90.4800001953125
adamw, lr=1e-03	199.0	2.223684310913086	1.1657958560943604	71.22999993896484	90.30999958496092
cadamw, lr=2e-04	189.0	2.538627862930298	1.2325929063796996	68.94999995117188	89.61000139160156
adamw, lr=2e-04	203.0	2.579624652862549	1.3085522148132325	67.11000026855469	88.66000164794922

AdamW Top-1 Evaluation Accuracy on Mini-ImageNet

AdamW Train Loss

MARS

optim	best_epoch	train_loss	eval_loss	eval_top1	eval_top5
cmars, lr=1e-03	198.0	2.054780960083008	1.0435627010345458	74.91000185546875	92.08000146484376
cmars, lr=2e-03	203.0	2.0272469520568848	1.0705795244216918	74.31000185546876	91.54000092773435
mars, lr=1e-03	184.0	2.219767808914185	1.07215625667572	74.06000178222656	91.6200013671875
mars, lr=2e-03	197.0	2.1453990936279297	1.0963781481742858	73.73000098876953	91.1500006225586
cmars, lr=5e-04	198.0	2.2018630504608154	1.083557384109497	73.32000045166015	91.67000092773438
mars, lr=5e-04	189.0	2.322845220565796	1.1199828132629397	72.02999995117187	90.86000173339843

rwightman
/

timm-optim-caution