NVlabs/RocketKV
[ICML 2025] RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression
This project helps machine learning engineers and researchers accelerate the inference speed of Large Language Models (LLMs) when dealing with very long text inputs, without sacrificing accuracy. It takes an existing LLM and compresses the 'KV cache'—a memory component critical for efficient decoding—to enable faster processing and reduce memory usage. The output is a significantly quicker and more resource-efficient LLM for tasks requiring extensive context.
No commits in the last 6 months.
Use this if you are a machine learning engineer or researcher frequently running LLMs with long text inputs and are hitting memory or speed bottlenecks during the decoding phase.
Not ideal if you are a data scientist or application developer who primarily uses LLMs via APIs and is not directly managing their deployment or underlying infrastructure.
Stars
34
Forks
5
Language
Python
License
—
Category
Last pushed
Aug 07, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/NVlabs/RocketKV"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
ModelCloud/GPTQModel
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD...
intel/auto-round
🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality...
pytorch/ao
PyTorch native quantization and sparsity for training and inference
bodaay/HuggingFaceModelDownloader
Simple go utility to download HuggingFace Models and Datasets
NVIDIA/kvpress
LLM KV cache compression made easy