codepawl/turboquant-torch

Unofficial PyTorch implementation of TurboQuant (Google Research, ICLR 2026). Near-optimal vector quantization for KV cache compression and vector search. 3-bit with zero accuracy loss.

/ 100

Experimental

This tool helps AI practitioners and researchers dramatically reduce the memory footprint of large language models (LLMs) during inference. It takes your existing PyTorch LLM and compresses its internal memory (KV cache) or vector databases, outputting a model that uses significantly less RAM with virtually no loss in accuracy. This is designed for anyone running LLMs where memory efficiency is critical.

Use this if you need to run large AI models on hardware with limited memory, want to reduce inference costs, or speed up vector search operations.

Not ideal if you are working with extremely small models where memory is not a bottleneck or if your primary concern is maximum training speed rather than inference efficiency.

LLM deployment AI inference optimization vector search memory management resource-constrained AI

No Package No Dependents

Maintenance 13 / 25

Adoption 5 / 25

Maturity 9 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

Python

License

MIT

Compare

turboquant-torch and turboquant

Higher-rated alternatives

LMCache/LMCache

Supercharge Your LLM with the Fastest KV Cache Layer

Zefan-Cai/KVCache-Factory

Unified KV Cache Compression Methods for Auto-Regressive Models

dataflowr/llm_efficiency

KV Cache & LoRA for minGPT

OnlyTerp/kvtc

First open-source KVTC implementation (NVIDIA, ICLR 2026) -- 8-32x KV cache compression via PCA...

itsnamgyu/block-transformer

Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)

Explore Transformer Models

All categories Trending Transformer directory Insights