Papers
arxiv:2411.17691

Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens

Published on Nov 26
ยท Submitted by sggetao on Nov 27
Authors:
,
Tao Ge ,
,
,
,

Abstract

We reveal that low-bit quantization favors undertrained large language models (LLMs) by observing that models with larger sizes or fewer training tokens experience less quantization-induced degradation (QiD) when applying low-bit quantization, whereas smaller models with extensive training tokens suffer significant QiD. To gain deeper insights into this trend, we study over 1500 quantized LLM checkpoints of various sizes and at different training levels (undertrained or fully trained) in a controlled setting, deriving scaling laws for understanding the relationship between QiD and factors such as the number of training tokens, model size and bit width. With the derived scaling laws, we propose a novel perspective that we can use QiD to measure an LLM's training levels and determine the number of training tokens required for fully training LLMs of various sizes. Moreover, we use the scaling laws to predict the quantization performance of different-sized LLMs trained with 100 trillion tokens. Our projection shows that the low-bit quantization performance of future models, which are expected to be trained with over 100 trillion tokens, may NOT be desirable. This poses a potential challenge for low-bit quantization in the future and highlights the need for awareness of a model's training level when evaluating low-bit quantization research. To facilitate future research on this problem, we release all the 1500+ quantized checkpoints used in this work at https://huggingface.co./Xu-Ouyang.

Community

Paper author Paper submitter

Takeaways:

  1. We found that low-bit quantization favors undertrained LLMs that are either large or trained with a small number of tokens. For fully trained LLMs, it will cause severe quantization-induced degradation (QiD) (Figure 2).
  2. We derive scaling laws to predict QiD when low-bit quantization is applied to a given LLM based on its model size, training tokens, and bit width (Section 3.5).
  3. We use QiD to determine whether an LLM is fully trained, estimating using the derived scaling laws that a 70B model requires 17 trillion tokens to be relatively fully trained, while a 405B model needs nearly 50 trillion tokens.
  4. We use our derived scaling laws to predict the quantization-induced degradation for 7B, 70B, and 405B models trained with 100 trillion tokens when applying low-bit quantization.
Paper author Paper submitter

image.png

See this figure in case Figure 1 is not properly displayed (no gray areas) in your Chrome/Edge browser.

Nice work!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2411.17691 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2411.17691 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2411.17691 in a Space README.md to link it from this page.

Collections including this paper 2