jy-yuan/KIVI

[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

/ 100

Emerging

This project helps large language model (LLM) developers and researchers deploy their models more efficiently. It takes existing LLMs, like Llama-2 or Mistral, and optimizes their internal memory usage. The output is an LLM that runs faster, handles larger batches of requests, and uses significantly less memory, all without needing extensive fine-tuning.

359 stars.

Use this if you are a machine learning engineer or researcher looking to improve the inference speed and memory footprint of your LLMs, especially when working with models like Llama, Falcon, or Mistral.

Not ideal if you are an end-user of an LLM and do not directly manage model deployment or infrastructure.

LLM deployment model inference optimization deep learning engineering AI infrastructure large language models

No Package No Dependents

Maintenance 6 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 17 / 25

How are scores calculated?

Stars

359

Forks

Language

Python

License

MIT

Higher-rated alternatives

ModelCloud/GPTQModel

LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD...

intel/auto-round

🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality...

pytorch/ao

PyTorch native quantization and sparsity for training and inference

bodaay/HuggingFaceModelDownloader

Simple go utility to download HuggingFace Models and Datasets

NVIDIA/kvpress

LLM KV cache compression made easy

Explore Transformer Models

All categories Trending Transformer directory Insights