arxiv:2503.04725

L^2M: Mutual Information Scaling Law for Long-Context Language Modeling

Published on Mar 6

· Submitted by

zhuoc3 on Mar 7

Upvote

Authors:

Zhuo Chen ,

Oriol Mayné i Comas ,

Abstract

We rigorously establish a bipartite mutual information scaling law in natural language that governs long-range dependencies. This scaling law, which we show is distinct from and scales independently of the conventional two-point mutual information, is the key to understanding long-context language modeling. Using this scaling law, we formulate the Long-context Language Modeling (L^2M) condition, which relates a model's capacity for effective long context length modeling to the scaling of its latent state size for storing past information. Our results are validated through experiments on both transformers and state space models. This work establishes a theoretical foundation that guides the development of large language models toward longer context lengths.

View arXiv page View PDF Project page GitHub repository Add to collection

Community

zhuoc3

Paper author Paper submitter 2 days ago

This paper establishes a fundamental bipartite mutual information scaling law in natural language that follows power-law growth (L^β). The authors show this scaling is distinct from that of conventional two-point mutual information and is the key to understanding long-context language modeling. Based on this insight, they formulate the Long-context Language Modeling (L²M) condition, which relates a model's ability to handle long contexts to how its history state dimensions must scale. Their empirical validation confirms the theoretical predictions across different architectures, demonstrating how the scaling behavior of history states affects performance on long-range dependencies. These findings provide a theoretical foundation for understanding long-range dependencies in language models and guiding architecture development. Code is available at https://github.com/LSquaredM/mutual_info_scaling_law.