LLM Pre-Train
Paper • 2001.08361 • Published • 7Note OpenAI 2020 Scaling Law-1
Scaling Laws for Autoregressive Generative Modeling
Paper • 2010.14701 • PublishedNote OpenAI 2020 Scaling Law-2
Training Compute-Optimal Large Language Models
Paper • 2203.15556 • Published • 10Note DeepMind 2022 Scaling Law-3 The Bitter Lesson. by Rich Sutton 2019 http://www.incompleteideas.net/IncIdeas/BitterLesson.html
A Survey on Data Selection for Language Models
Paper • 2402.16827 • Published • 4A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity
Paper • 2305.13169 • Published • 3
Understanding Emergent Abilities of Language Models from the Loss Perspective
Paper • 2403.15796 • PublishedNote zhipu
Phi-4 Technical Report
Paper • 2412.08905 • Published • 106Note 微软-phi-4-14B https://mp.weixin.qq.com/s/zFDvFrR1wtz5ZpdAk1mT7w 1. 关键 Token 搜索(Pivotal Token Search, PTS) PTS 方法 是 Phi-4 训练过程中的一大创新: 原理: 通过识别在生成过程中对答案正确性有重大影响的关键 Token,针对性地优化模型在这些 Token 上的预测。 优势: 提高训练效率: 将优化重点放在对结果影响最大的部分,事半功倍。 改善模型性能: 有助于模型在关键决策点上做出正确选择,提高整体输出质量。 2. 改进的直接偏好优化(DPO) DPO 方法: 直接使用偏好数据进行优化,使模型的输出更符合人类的偏好。 创新点: 结合 PTS: 在 DPO 中引入 PTS 生成的训练数据对,提高优化效果。 评估指标: 通过对模型在关键 Token 上的表现进行评估,更精确地衡量优化效果。
Qwen2.5 Technical Report
Paper • 2412.15115 • Published • 343
Movie Gen: A Cast of Media Foundation Models
Paper • 2410.13720 • Published • 91Note Meta Multi Model
Measuring the Effects of Data Parallelism on Neural Network Training
Paper • 1811.03600 • Published • 2Note Google:batchsize 和 lr /training step等等关系
Mixture-of-Agents Enhances Large Language Model Capabilities
Paper • 2406.04692 • Published • 56
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations
Paper • 2405.18392 • Published • 12Note Hugging Face
Resolving Discrepancies in Compute-Optimal Scaling of Language Models
Paper • 2406.19146 • PublishedNote LAION 38th Conference on Neural Information Processing Systems (NeurIPS 2024).
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
Paper • 2401.02954 • Published • 42Language models scale reliably with over-training and on downstream tasks
Paper • 2403.08540 • Published • 15Getting the most out of your tokenizer for pre-training and domain adaptation
Paper • 2402.01035 • Published • 2