Cautious Optimizers: Improving Training with One Line of Code
Abstract
AdamW has been the default optimizer for transformer pretraining. For many years, our community searches for faster and more stable optimizers with only constraint positive outcomes. In this work, we propose a single-line modification in Pytorch to any momentum-based optimizer, which we rename Cautious Optimizer, e.g. C-AdamW and C-Lion. Our theoretical result shows that this modification preserves Adam's Hamiltonian function and it does not break the convergence guarantee under the Lyapunov analysis. In addition, a whole new family of optimizers is revealed by our theoretical insight. Among them, we pick the simplest one for empirical experiments, showing speed-up on Llama and MAE pretraining up to 1.47times. Code is available at https://github.com/kyleliang919/C-Optim
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- A second-order-like optimizer with adaptive gradient scaling for deep learning (2024)
- Natural GaLore: Accelerating GaLore for memory-efficient LLM Training and Fine-tuning (2024)
- Adam Exploits ℓ∞-geometry of Loss Landscape via Coordinate-wise Adaptivity (2024)
- ADOPT: Modified Adam Can Converge with Any $\beta_2$ with the Optimal Rate (2024)
- FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training (2024)
- MomentumSMoE: Integrating Momentum into Sparse Mixture of Experts (2024)
- Old Optimizer, New Norm: An Anthology (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper