TransMLA: Multi-head Latent Attention Is All You Need
Abstract
Modern large language models (LLMs) often encounter communication bottlenecks on current hardware, rather than purely computational constraints. Multi-head Latent Attention (MLA) tackles this challenge by using low-rank matrices in the key-value (KV) layers, thereby allowing compressed latent KV states to be cached. This approach significantly reduces the KV cache size relative to traditional multi-head attention, leading to faster inference. Moreover, MLA employs an up-projection matrix to increase expressiveness, trading additional computation for reduced communication overhead. Although MLA has demonstrated efficiency and effectiveness in Deepseek V2/V3/R1, many major model providers still rely on Group Query Attention (GQA) and have not announced any plans to adopt MLA. In this paper, we show that GQA can always be represented by MLA while maintaining the same KV cache overhead, but the converse does not hold. To encourage broader use of MLA, we introduce **TransMLA**, a post-training method that converts widely used GQA-based pre-trained models (e.g., LLaMA, Qwen, Mixtral) into MLA-based models. After conversion, the model can undergo additional training to boost expressiveness without increasing the KV cache size. Furthermore, we plan to develop MLA-specific inference acceleration techniques to preserve low latency in transformed models, thus enabling more efficient distillation of Deepseek R1.
Community
- Provided theoretical proof and experimental validation that the expressive capacity of MLA surpasses that of GQA, given the same KV Cache overhead.
- Transformed models such as LLaMA-3, Qwen-2.5, and others into equivalent MLA models.
- Utilized the enhanced models to replicate R1 (to do).
would have been nice to also see the results when training the whole model instead of only the keys and values
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Multi-matrix Factorization Attention (2024)
- Tensor Product Attention Is All You Need (2025)
- AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference (2025)
- Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models (2025)
- Deliberation in Latent Space via Differentiable Cache Augmentation (2024)
- DeepSeek-V3 Technical Report (2024)
- ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Dear authors,
Your paper TransMLA seems closely related to “Tensor Production Attention is All You Need?” (https://arxiv.org/abs/2501.06425), which unifies multiple attention mechanisms (MHA, MQA, GQA) as non-contextual instances of TPA. Given TransMLA’s thematic overlap and its title’s resemblance to the TPA work, it would benefit readers to see a direct comparison or citation. How does TransMLA fit into or diverge from TPA’s factorization framework? A brief discussion would clarify TransMLA’s novelty and strengthen the paper’s positioning. I kindly request that you cite TPA and highlight any key similarities or differences in a future revision.
Thank you.
In addition, may I ask if your method is RoPE compatible or not?
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper