arxiv:2412.01505

Scaling Law for Language Models Training Considering Batch Size

Published on Dec 2, 2024

Authors:

Abstract

Large language models (LLMs) have made remarkable advances in recent years, with scaling laws playing a critical role in this rapid progress. In this paper, we empirically investigate how a critical hyper-parameter, i.e., the global batch size, influences the LLM training prdocess. We begin by training language models ranging from 125 million to 2.6 billion parameters, using up to 300 billion high-quality tokens. Through these experiments, we establish a basic scaling law on model size and training data amount. We then examine how varying batch sizes and learning rates affect the convergence and generalization of these models. Our analysis yields batch size scaling laws under two different cases: with a fixed compute budget, and with a fixed amount of training data. Extrapolation experiments on models of increasing sizes validate our predicted laws, which provides guidance for optimizing LLM training strategies under specific resource constraints.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2412.01505 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2412.01505 in a dataset README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2412.01505 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.