OnlyTerp/turboquant

First open-source implementation of Google TurboQuant (ICLR 2026) -- near-optimal KV cache compression for LLM inference. 5x compression with near-zero quality loss.

/ 100

Emerging

This project helps you run large language models (LLMs) more efficiently by significantly reducing the memory they need during inference. It takes the model's internal 'KV cache' data and compresses it by up to 7 times while maintaining almost the same quality in the model's responses. Anyone who deploys or manages LLMs and wants to serve more users, handle longer text inputs, or reduce GPU costs would find this valuable.

Use this if you are running LLMs and frequently hit GPU memory limits or want to increase the throughput and context length for your users.

Not ideal if you need a production-ready, highly optimized solution for immediate deployment, as this is a reference implementation focused on correctness rather than speed.

LLM deployment AI inference optimization GPU memory management large language models model serving

No Package No Dependents

Maintenance 13 / 25

Adoption 7 / 25

Maturity 9 / 25

Community 6 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Compare

turboquant and turboquant-torch

Higher-rated alternatives

LMCache/LMCache

Supercharge Your LLM with the Fastest KV Cache Layer

Zefan-Cai/KVCache-Factory

Unified KV Cache Compression Methods for Auto-Regressive Models

dataflowr/llm_efficiency

KV Cache & LoRA for minGPT

OnlyTerp/kvtc

First open-source KVTC implementation (NVIDIA, ICLR 2026) -- 8-32x KV cache compression via PCA...

itsnamgyu/block-transformer

Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)

Explore Transformer Models

All categories Trending Transformer directory Insights