Abstract
Deep learning optimizers are often motivated through a mix of convex and approximate second-order theory. We select three such methods -- Adam, Shampoo and Prodigy -- and argue that each method can instead be understood as a squarely first-order method without convexity assumptions. In fact, after switching off exponential moving averages, each method is equivalent to steepest descent under a particular norm. By generalizing this observation, we chart a new design space for training algorithms. Different operator norms should be assigned to different tensors based on the role that the tensor plays within the network. For example, while linear and embedding layers may have the same weight space of R^{mtimes n}, these layers play different roles and should be assigned different norms. We hope that this idea of carefully metrizing the neural architecture might lead to more stable, scalable and indeed faster training.
Community
Deep learning optimizers are often motivated through a mix of convex and approximate
second-order theory. We select three such methods—Adam, Shampoo and Prodigy—and
argue that each method can instead be understood as a squarely first-order method without
convexity assumptions. In fact, after switching off exponential moving averages, each method
is equivalent to steepest descent under a particular norm. By generalizing this observation,
we chart a new design space for training algorithms. Different operator norms should be
assigned to different tensors based on the role that the tensor plays within the network.
For example, while linear and embedding layers may have the same weight space of Rmˆn,
these layers play different roles and should be assigned different norms. We hope that this
idea of carefully metrizing the neural architecture might lead to more stable, scalable and
indeed faster training.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Unified Gradient-Based Machine Unlearning with Remain Geometry Enhancement (2024)
- SOAP: Improving and Stabilizing Shampoo using Adam (2024)
- Narrowing the Focus: Learned Optimizers for Pretrained Models (2024)
- WarpAdam: A new Adam optimizer based on Meta-Learning approach (2024)
- Memory-Efficient LLM Training with Online Subspace Descent (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper