NVlabs/RocketKV

[ICML 2025] RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression

36
/ 100
Emerging

This project helps machine learning engineers and researchers accelerate the inference speed of Large Language Models (LLMs) when dealing with very long text inputs, without sacrificing accuracy. It takes an existing LLM and compresses the 'KV cache'—a memory component critical for efficient decoding—to enable faster processing and reduce memory usage. The output is a significantly quicker and more resource-efficient LLM for tasks requiring extensive context.

No commits in the last 6 months.

Use this if you are a machine learning engineer or researcher frequently running LLMs with long text inputs and are hitting memory or speed bottlenecks during the decoding phase.

Not ideal if you are a data scientist or application developer who primarily uses LLMs via APIs and is not directly managing their deployment or underlying infrastructure.

LLM deployment NLP inference optimization Large Language Model engineering AI model efficiency Machine learning infrastructure
Stale 6m No Package No Dependents
Maintenance 2 / 25
Adoption 7 / 25
Maturity 15 / 25
Community 12 / 25

How are scores calculated?

Stars

34

Forks

5

Language

Python

License

Last pushed

Aug 07, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/NVlabs/RocketKV"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.