Why do Learning Rates Transfer? Reconciling Optimization and Scaling Limits for Deep Learning Paper • 2402.17457 • Published Feb 27
Curvature-Informed SGD via General Purpose Lie-Group Preconditioners Paper • 2402.04553 • Published Feb 7
Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling Paper • 2405.14578 • Published May 23
Fast Benchmarking of Accuracy vs. Training Time with Cyclic Learning Rates Paper • 2206.00832 • Published Jun 2, 2022
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective Paper • 2410.23743 • Published Oct 31 • 59
ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models Paper • 2410.09637 • Published Oct 12 • 3
nGPT: Normalized Transformer with Representation Learning on the Hypersphere Paper • 2410.01131 • Published Oct 1 • 9
Cautious Optimizers: Improving Training with One Line of Code Paper • 2411.16085 • Published about 1 month ago • 15
MARS: Unleashing the Power of Variance Reduction for Training Large Models Paper • 2411.10438 • Published Nov 15 • 13
Understanding Gradient Descent through the Training Jacobian Paper • 2412.07003 • Published 16 days ago