jy-yuan/KIVI
[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
This project helps large language model (LLM) developers and researchers deploy their models more efficiently. It takes existing LLMs, like Llama-2 or Mistral, and optimizes their internal memory usage. The output is an LLM that runs faster, handles larger batches of requests, and uses significantly less memory, all without needing extensive fine-tuning.
359 stars.
Use this if you are a machine learning engineer or researcher looking to improve the inference speed and memory footprint of your LLMs, especially when working with models like Llama, Falcon, or Mistral.
Not ideal if you are an end-user of an LLM and do not directly manage model deployment or infrastructure.
Stars
359
Forks
44
Language
Python
License
MIT
Category
Last pushed
Nov 20, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/jy-yuan/KIVI"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
ModelCloud/GPTQModel
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD...
intel/auto-round
🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality...
pytorch/ao
PyTorch native quantization and sparsity for training and inference
bodaay/HuggingFaceModelDownloader
Simple go utility to download HuggingFace Models and Datasets
NVIDIA/kvpress
LLM KV cache compression made easy