SqueezeAILab/KVQuant
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
This project helps AI developers and researchers overcome memory limitations when running large language models (LLMs) that handle extremely long texts or conversations. It takes a trained LLM and optimizes its memory usage, allowing it to process millions of tokens of context on less powerful or fewer GPUs. The result is the ability to deploy powerful LLMs for applications requiring extensive context, such as analyzing large documents or maintaining very long dialogues.
406 stars. No commits in the last 6 months.
Use this if you are a machine learning engineer or researcher struggling to deploy LLMs with very long context windows due to excessive GPU memory consumption.
Not ideal if you are looking for a pre-trained LLM or a general-purpose inference acceleration library that doesn't specifically target KV cache memory optimization.
Stars
406
Forks
37
Language
Python
License
—
Category
Last pushed
Aug 13, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/SqueezeAILab/KVQuant"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
ModelCloud/GPTQModel
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD...
intel/auto-round
🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality...
pytorch/ao
PyTorch native quantization and sparsity for training and inference
bodaay/HuggingFaceModelDownloader
Simple go utility to download HuggingFace Models and Datasets
NVIDIA/kvpress
LLM KV cache compression made easy