snu-mllab/KVzip
[NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)
This tool helps large language model (LLM) developers optimize their models by significantly reducing the memory used by the KV cache during inference. By compressing the KV cache, it allows for more efficient processing of diverse user queries, leading to faster response times and the ability to handle longer contexts. It takes an LLM with a context and a series of user queries, and outputs faster, more memory-efficient generated responses.
209 stars.
Use this if you are a developer deploying large language models and need to reduce memory footprint and improve inference speed without significant performance degradation.
Not ideal if you are a general LLM user or a developer working with models that do not rely on a KV cache for inference.
Stars
209
Forks
9
Language
Python
License
MIT
Category
Last pushed
Feb 11, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/snu-mllab/KVzip"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
Zefan-Cai/R-KV
[Neurips 2025] R-KV: Redundancy-aware KV Cache Compression for Reasoning Models
codefuse-ai/ModelCache
A LLM semantic caching system aiming to enhance user experience by reducing response time via...
philtimmes/KeSSie
KeSSie HUGE Context Semantic recall for Large Language Models