turboquant and turboquant-torch
These tools are competitors, as both claim to be implementations of the same TurboQuant technique for KV cache compression in LLMs.
About turboquant
OnlyTerp/turboquant
First open-source implementation of Google TurboQuant (ICLR 2026) -- near-optimal KV cache compression for LLM inference. 5x compression with near-zero quality loss.
This project helps you run large language models (LLMs) more efficiently by significantly reducing the memory they need during inference. It takes the model's internal 'KV cache' data and compresses it by up to 7 times while maintaining almost the same quality in the model's responses. Anyone who deploys or manages LLMs and wants to serve more users, handle longer text inputs, or reduce GPU costs would find this valuable.
About turboquant-torch
codepawl/turboquant-torch
Unofficial PyTorch implementation of TurboQuant (Google Research, ICLR 2026). Near-optimal vector quantization for KV cache compression and vector search. 3-bit with zero accuracy loss.
This tool helps AI practitioners and researchers dramatically reduce the memory footprint of large language models (LLMs) during inference. It takes your existing PyTorch LLM and compresses its internal memory (KV cache) or vector databases, outputting a model that uses significantly less RAM with virtually no loss in accuracy. This is designed for anyone running LLMs where memory efficiency is critical.
Scores updated daily from GitHub, PyPI, and npm data. How scores work