FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation
Abstract
While large language models (LLMs) excel at handling long-context sequences, they require substantial key-value (KV) caches to store contextual information, which can heavily burden computational efficiency and memory usage. Previous efforts to compress these KV caches primarily focused on reducing memory demands but were limited in enhancing latency. To address this issue, we introduce FastKV, a KV cache compression method designed to enhance latency for long-context sequences. To enhance processing speeds while maintaining accuracy, FastKV adopts a novel Token-Selective Propagation (TSP) approach that retains the full context information in the initial layers of LLMs and selectively propagates only a portion of this information in deeper layers even in the prefill stage. Additionally, FastKV incorporates grouped-query attention (GQA)-aware KV cache compression to exploit the advantages of GQA in both memory and computational efficiency. Our experimental results show that FastKV achieves 2.00times and 1.40times improvements in time-to-first-token (TTFT) and throughput, respectively, compared to HeadKV, the state-of-the-art KV cache compression method. Moreover, FastKV successfully maintains accuracy on long-context benchmarks at levels comparable to the baselines. Our code is available at https://github.com/dongwonjo/FastKV.
Community
Introducing FastKV, a novel KV cache compression method designed to enhance inference efficiency for long-context LLMs while maintaining high accuracy.
Paper: https://arxiv.org/abs/2502.01068
Github: https://github.com/dongwonjo/FastKV
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression (2024)
- HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing (2024)
- Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models (2025)
- DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs (2024)
- Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression (2024)
- SCBench: A KV Cache-Centric Analysis of Long-Context Methods (2024)
- SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper