arxiv:2409.20325

Old Optimizer, New Norm: An Anthology

Published on Sep 30

· Submitted by

Authors:

Abstract

Deep learning optimizers are often motivated through a mix of convex and approximate second-order theory. We select three such methods -- Adam, Shampoo and Prodigy -- and argue that each method can instead be understood as a squarely first-order method without convexity assumptions. In fact, after switching off exponential moving averages, each method is equivalent to steepest descent under a particular norm. By generalizing this observation, we chart a new design space for training algorithms. Different operator norms should be assigned to different tensors based on the role that the tensor plays within the network. For example, while linear and embedding layers may have the same weight space of R^{mtimes n}, these layers play different roles and should be assigned different norms. We hope that this idea of carefully metrizing the neural architecture might lead to more stable, scalable and indeed faster training.

View arXiv page View PDF Add to collection

Community

eliebak

Paper submitter Oct 3

Deep learning optimizers are often motivated through a mix of convex and approximate
second-order theory. We select three such methods—Adam, Shampoo and Prodigy—and
argue that each method can instead be understood as a squarely first-order method without
convexity assumptions. In fact, after switching off exponential moving averages, each method
is equivalent to steepest descent under a particular norm. By generalizing this observation,
we chart a new design space for training algorithms. Different operator norms should be
assigned to different tensors based on the role that the tensor plays within the network.
For example, while linear and embedding layers may have the same weight space of Rmˆn,
these layers play different roles and should be assigned different norms. We hope that this
idea of carefully metrizing the neural architecture might lead to more stable, scalable and
indeed faster training.

librarian-bot

Oct 4

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2409.20325 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2409.20325 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2409.20325 in a Space README.md to link it from this page.