nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 Text Generation • Updated about 7 hours ago • 14.9k • • 256
ZClip: Adaptive Spike Mitigation for LLM Pre-Training Paper • 2504.02507 • Published 16 days ago • 76
ZClip: Adaptive Spike Mitigation for LLM Pre-Training Paper • 2504.02507 • Published 16 days ago • 76
A Refined Analysis of Massive Activations in LLMs Paper • 2503.22329 • Published 22 days ago • 14
A Refined Analysis of Massive Activations in LLMs Paper • 2503.22329 • Published 22 days ago • 14
Variance Control via Weight Rescaling in LLM Pre-training Paper • 2503.17500 • Published 28 days ago • 5
Variance Control via Weight Rescaling in LLM Pre-training Paper • 2503.17500 • Published 28 days ago • 5
Running 2.48k 2.48k The Ultra-Scale Playbook 🌌 The ultimate guide to training LLM on large GPU Clusters
view article Article Falcon 2: An 11B parameter pretrained language model and VLM, trained on over 5000B tokens tokens and 11 languages By Quent-01 and 9 others • May 24, 2024 • 25