snu-mllab/KVzip

[NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)

/ 100

Emerging

This tool helps large language model (LLM) developers optimize their models by significantly reducing the memory used by the KV cache during inference. By compressing the KV cache, it allows for more efficient processing of diverse user queries, leading to faster response times and the ability to handle longer contexts. It takes an LLM with a context and a series of user queries, and outputs faster, more memory-efficient generated responses.

209 stars.

Use this if you are a developer deploying large language models and need to reduce memory footprint and improve inference speed without significant performance degradation.

Not ideal if you are a general LLM user or a developer working with models that do not rely on a KV cache for inference.

LLM deployment model serving inference optimization GPU memory management AI infrastructure

No Package No Dependents

Maintenance 10 / 25

Adoption 10 / 25

Maturity 15 / 25

Community 9 / 25

How are scores calculated?

Stars

209

Forks

Language

Python

License

MIT

Higher-rated alternatives

Zefan-Cai/R-KV

[Neurips 2025] R-KV: Redundancy-aware KV Cache Compression for Reasoning Models

codefuse-ai/ModelCache

A LLM semantic caching system aiming to enhance user experience by reducing response time via...

philtimmes/KeSSie

KeSSie HUGE Context Semantic recall for Large Language Models

Explore Embedding Tools

All categories Trending Embeddings directory Insights