NVIDIA/kvpress

LLM KV cache compression made easy

/ 100

Established

Deploying large language models (LLMs) that handle very long inputs can be expensive due to the significant memory needed to store contextual information. This project provides various methods to compress this memory, allowing LLMs to process longer texts more efficiently and cost-effectively. It is designed for machine learning engineers and researchers working with LLMs.

954 stars. Actively maintained with 6 commits in the last 30 days.

Use this if you are developing or deploying large language models and need to reduce memory usage when processing long input texts, or if you are exploring new LLM compression techniques.

Not ideal if you are a general LLM user or application developer not concerned with the underlying model deployment or research into memory optimization.

LLM deployment large language models memory optimization NLP research AI infrastructure

No Package No Dependents

Maintenance 17 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 20 / 25

How are scores calculated?

Stars

954

Forks

121

Language

Python

License

Apache-2.0

Related models

ModelCloud/GPTQModel

LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD...

intel/auto-round

🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality...

pytorch/ao

PyTorch native quantization and sparsity for training and inference

bodaay/HuggingFaceModelDownloader

Simple go utility to download HuggingFace Models and Datasets

BlinkDL/RWKV-LM

RWKV (pronounced RwaKuv) is an RNN with great LLM performance, which can also be directly...

Explore Transformer Models

All categories Trending Transformer directory Insights