NVlabs/RocketKV

[ICML 2025] RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression

/ 100

Emerging

This project helps machine learning engineers and researchers accelerate the inference speed of Large Language Models (LLMs) when dealing with very long text inputs, without sacrificing accuracy. It takes an existing LLM and compresses the 'KV cache'—a memory component critical for efficient decoding—to enable faster processing and reduce memory usage. The output is a significantly quicker and more resource-efficient LLM for tasks requiring extensive context.

No commits in the last 6 months.

Use this if you are a machine learning engineer or researcher frequently running LLMs with long text inputs and are hitting memory or speed bottlenecks during the decoding phase.

Not ideal if you are a data scientist or application developer who primarily uses LLMs via APIs and is not directly managing their deployment or underlying infrastructure.

LLM deployment NLP inference optimization Large Language Model engineering AI model efficiency Machine learning infrastructure

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 7 / 25

Maturity 15 / 25

Community 12 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

ModelCloud/GPTQModel

LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD...

intel/auto-round

🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality...

pytorch/ao

PyTorch native quantization and sparsity for training and inference

bodaay/HuggingFaceModelDownloader

Simple go utility to download HuggingFace Models and Datasets

NVIDIA/kvpress

LLM KV cache compression made easy

Explore Transformer Models

All categories Trending Transformer directory Insights