NVIDIA/kvpress

LLM KV cache compression made easy

63
/ 100
Established

Deploying large language models (LLMs) that handle very long inputs can be expensive due to the significant memory needed to store contextual information. This project provides various methods to compress this memory, allowing LLMs to process longer texts more efficiently and cost-effectively. It is designed for machine learning engineers and researchers working with LLMs.

954 stars. Actively maintained with 6 commits in the last 30 days.

Use this if you are developing or deploying large language models and need to reduce memory usage when processing long input texts, or if you are exploring new LLM compression techniques.

Not ideal if you are a general LLM user or application developer not concerned with the underlying model deployment or research into memory optimization.

LLM deployment large language models memory optimization NLP research AI infrastructure
No Package No Dependents
Maintenance 17 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 20 / 25

How are scores calculated?

Stars

954

Forks

121

Language

Python

License

Apache-2.0

Last pushed

Mar 12, 2026

Commits (30d)

6

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/NVIDIA/kvpress"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.