Abstract
When predicting the next token in a sequence, vanilla transformers compute attention over all previous tokens, resulting in quadratic scaling of compute with sequence length. State-space models compress the entire sequence of tokens into a fixed-dimensional representation to improve efficiency, while other architectures achieve sub-quadratic complexity via low-rank projections or sparse attention patterns over the sequence. In this paper, we introduce Attamba, a novel architecture that uses state-space models to compress chunks of tokens and applies attention on these compressed key-value representations. We find that replacing key and value projections in a transformer with SSMs can improve model quality and enable flexible token chunking, resulting in 24% improved perplexity with transformer of similar KV-Cache and attention footprint, and ~4 times smaller KV-Cache and Attention FLOPs for 5% perplexity trade-off. Attamba can perform attention on chunked-sequences of variable length, enabling a smooth transition between quadratic and linear scaling, offering adaptable efficiency gains.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- KV-Distill: Nearly Lossless Learnable Context Compression for LLMs (2025)
- ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference (2025)
- SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs (2025)
- LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference (2025)
- ZeroMerge: Parameter-Free KV Cache Compression for Memory-Efficient Long-Context LLMs (2025)
- FlexTok: Resampling Images into 1D Token Sequences of Flexible Length (2025)
- Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper