NVIDIA/kvpress
LLM KV cache compression made easy
Deploying large language models (LLMs) that handle very long inputs can be expensive due to the significant memory needed to store contextual information. This project provides various methods to compress this memory, allowing LLMs to process longer texts more efficiently and cost-effectively. It is designed for machine learning engineers and researchers working with LLMs.
954 stars. Actively maintained with 6 commits in the last 30 days.
Use this if you are developing or deploying large language models and need to reduce memory usage when processing long input texts, or if you are exploring new LLM compression techniques.
Not ideal if you are a general LLM user or application developer not concerned with the underlying model deployment or research into memory optimization.
Stars
954
Forks
121
Language
Python
License
Apache-2.0
Category
Last pushed
Mar 12, 2026
Commits (30d)
6
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/NVIDIA/kvpress"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related models
ModelCloud/GPTQModel
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD...
intel/auto-round
🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality...
pytorch/ao
PyTorch native quantization and sparsity for training and inference
bodaay/HuggingFaceModelDownloader
Simple go utility to download HuggingFace Models and Datasets
BlinkDL/RWKV-LM
RWKV (pronounced RwaKuv) is an RNN with great LLM performance, which can also be directly...