Zefan-Cai/KVCache-Factory
Unified KV Cache Compression Methods for Auto-Regressive Models
This project helps large language model (LLM) operators and researchers reduce the memory footprint and speed up inference for long text sequences. It takes existing LLMs and applies various KV cache compression methods, resulting in more efficient processing, especially for complex or lengthy prompts. The primary users are MLOps engineers, data scientists, or AI researchers deploying and experimenting with LLMs.
1,309 stars. No commits in the last 6 months.
Use this if you need to make your large language models run more efficiently on limited hardware, particularly when dealing with long user inputs or complex dialogues.
Not ideal if you are a casual user of LLMs and are not directly involved in their deployment or performance optimization.
Stars
1,309
Forks
163
Language
Python
License
MIT
Category
Last pushed
Jan 04, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/Zefan-Cai/KVCache-Factory"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
LMCache/LMCache
Supercharge Your LLM with the Fastest KV Cache Layer
dataflowr/llm_efficiency
KV Cache & LoRA for minGPT
OnlyTerp/kvtc
First open-source KVTC implementation (NVIDIA, ICLR 2026) -- 8-32x KV cache compression via PCA...
itsnamgyu/block-transformer
Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)
OnlyTerp/turboquant
First open-source implementation of Google TurboQuant (ICLR 2026) -- near-optimal KV cache...