KVPress: KV Cache Compression Leaderboard

NVIDIA/KVPress is a comprehensive library for compressing the KV cache of transformer models, featuring multiple state-of-the-art compression methods benchmarked using ๐Ÿค— transformers.

๐Ÿ’ก Why KV Cache Compression

  • Deploying long-context LLMs is costly due to the linear growth of the key-value (KV) cache in transformer models. For example, handling 1M tokens with Llama 3.1-70B in float16 requires up to 330GB of memory.
  • NVIDIA/KVPress implements multiple KV cache compression methods and benchmarks using Hugging Face transformers, aiming to simplify the development of new methods for researchers and developers in this field.
  • Full Transparency: We care about reproducibility and transparency. Each method in our leaderboard includes direct links to the source code and original research papers, along with the exact press initialization commands used for each experiment.
Filter Models
Compression ratio min
Compression ratio max
Filter Methods
Select Columns to Show
dataset
data_dir
model
method
compression_ratio
score
4096
AdaKVExpectedAttentionPress (source, notebook)
0.25
95.39