Zefan-Cai/R-KV
[Neurips 2025] R-KV: Redundancy-aware KV Cache Compression for Reasoning Models
When working with large language models to solve complex reasoning tasks, like advanced math problems, the model can generate very long thought processes that consume a lot of memory. R-KV helps by intelligently compressing this 'thought process' (KV cache) on the fly, discarding repetitive or less important information. This allows the model to achieve the same or even better accuracy while using significantly less memory and running much faster. It's designed for anyone deploying or managing large language models for reasoning-heavy applications.
1,183 stars.
Use this if you are running large language models for complex, multi-step reasoning tasks and are facing challenges with high memory usage or slow inference speeds due to long generated outputs.
Not ideal if your models generate short responses or if your primary bottleneck is prompt processing rather than output generation during reasoning.
Stars
1,183
Forks
190
Language
Python
License
—
Category
Last pushed
Oct 16, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/Zefan-Cai/R-KV"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
snu-mllab/KVzip
[NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in...
codefuse-ai/ModelCache
A LLM semantic caching system aiming to enhance user experience by reducing response time via...
philtimmes/KeSSie
KeSSie HUGE Context Semantic recall for Large Language Models