KVPress: KV Cache Compression Leaderboard
NVIDIA/KVPress is a comprehensive library for compressing the KV cache of transformer models, featuring multiple state-of-the-art compression methods benchmarked using ๐ค transformers.
๐ก Why KV Cache Compression
- Deploying long-context LLMs is costly due to the linear growth of the key-value (KV) cache in transformer models. For example, handling 1M tokens with Llama 3.1-70B in float16 requires up to 330GB of memory.
- NVIDIA/KVPress implements multiple KV cache compression methods and benchmarks using Hugging Face transformers, aiming to simplify the development of new methods for researchers and developers in this field.
- Full Transparency: We care about reproducibility and transparency. Each method in our leaderboard includes direct links to the source code and original research papers, along with the exact press initialization commands used for each experiment.
dataset | data_dir | model | method | compression_ratio | score |
---|---|---|---|---|---|
4096 | 0.25 | 95.39 |
๐ How to Submit Your Results
We are happy to welcome contributions to the library and to the leaderboard! Submit your results to the leaderboard by following these simple steps:
- ๐ง Implement your method in KVPress.
- โถ๏ธ Run evaluation using our provided script.
- ๐ค Submit results via Pull Request to this repository.
Detailed Steps
Step 1: Prepare Your Method
Implement your compression technique using the KVPress framework. Implementing a new press is very easy, you can check an example here.
Step 2: Run Evaluation
Execute the evaluation script on Ruler dataset with Llama3.1-8B. Evaluation in KVPress is run in one line:
python evaluation.py --method <your_method> --dataset ruler --model meta-llama/Meta-Llama-3.1-8B-Instruct
For a complete guide on evaluation, check the evaluation guide.
Step 3: Collect Results
The script generates a directory with the following structure:
<your_experiment_directory>/
โโโ predictions.csv
โโโ metrics.json
โโโ config.yaml
Step 4: Submit to Leaderboard
Fork this repository, add your experiment directory to the benchmark/
directory in this repository, and create a PR with title: Add <method_name> results
.
๐ Requirements
- Compatible with Llama3.1-8B model
- Evaluated on Ruler 4096 dataset
- Follows KVPress implementation standards
Questions? Contact us or open an issue!
๐ Citation
If you use KVPress in your research, please cite:
@misc{kvpress2024,
author = {Simon Jegou and Maximilian Jeblick and Alessio Devoto and Jiwei Liu and David Austin},
title = {KVPress: Efficient KV Cache Compression for Long-Context LLMs},
year = {2024},
url = {https://github.com/NVIDIA/kvpress},
note = {Version 1.2.0}
}
Links: GitHub