snu-mllab/KVzip

[NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)

44
/ 100
Emerging

This tool helps large language model (LLM) developers optimize their models by significantly reducing the memory used by the KV cache during inference. By compressing the KV cache, it allows for more efficient processing of diverse user queries, leading to faster response times and the ability to handle longer contexts. It takes an LLM with a context and a series of user queries, and outputs faster, more memory-efficient generated responses.

209 stars.

Use this if you are a developer deploying large language models and need to reduce memory footprint and improve inference speed without significant performance degradation.

Not ideal if you are a general LLM user or a developer working with models that do not rely on a KV cache for inference.

LLM deployment model serving inference optimization GPU memory management AI infrastructure
No Package No Dependents
Maintenance 10 / 25
Adoption 10 / 25
Maturity 15 / 25
Community 9 / 25

How are scores calculated?

Stars

209

Forks

9

Language

Python

License

MIT

Last pushed

Feb 11, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/snu-mllab/KVzip"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.