L^2M: Mutual Information Scaling Law for Long-Context Language Modeling
Abstract
We rigorously establish a bipartite mutual information scaling law in natural language that governs long-range dependencies. This scaling law, which we show is distinct from and scales independently of the conventional two-point mutual information, is the key to understanding long-context language modeling. Using this scaling law, we formulate the Long-context Language Modeling (L^2M) condition, which relates a model's capacity for effective long context length modeling to the scaling of its latent state size for storing past information. Our results are validated through experiments on both transformers and state space models. This work establishes a theoretical foundation that guides the development of large language models toward longer context lengths.
Community
This paper establishes a fundamental bipartite mutual information scaling law in natural language that follows power-law growth (L^β). The authors show this scaling is distinct from that of conventional two-point mutual information and is the key to understanding long-context language modeling. Based on this insight, they formulate the Long-context Language Modeling (L²M) condition, which relates a model's ability to handle long contexts to how its history state dimensions must scale. Their empirical validation confirms the theoretical predictions across different architectures, demonstrating how the scaling behavior of history states affects performance on long-range dependencies. These findings provide a theoretical foundation for understanding long-range dependencies in language models and guiding architecture development. Code is available at https://github.com/LSquaredM/mutual_info_scaling_law.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Explaining Context Length Scaling and Bounds for Language Models (2025)
- LongAttn: Selecting Long-context Training Data via Token-level Attention (2025)
- LCIRC: A Recurrent Compression Approach for Efficient Long-form Context and Query Dependent Modeling in LLMs (2025)
- LADM: Long-context Training Data Selection with Attention-based Dependency Measurement for LLMs (2025)
- Beyond In-Distribution Success: Scaling Curves of CoT Granularity for Language Model Generalization (2025)
- Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling (2025)
- Large Language Diffusion Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper