codepawl/turboquant-torch
Unofficial PyTorch implementation of TurboQuant (Google Research, ICLR 2026). Near-optimal vector quantization for KV cache compression and vector search. 3-bit with zero accuracy loss.
This tool helps AI practitioners and researchers dramatically reduce the memory footprint of large language models (LLMs) during inference. It takes your existing PyTorch LLM and compresses its internal memory (KV cache) or vector databases, outputting a model that uses significantly less RAM with virtually no loss in accuracy. This is designed for anyone running LLMs where memory efficiency is critical.
Use this if you need to run large AI models on hardware with limited memory, want to reduce inference costs, or speed up vector search operations.
Not ideal if you are working with extremely small models where memory is not a bottleneck or if your primary concern is maximum training speed rather than inference efficiency.
Stars
9
Forks
—
Language
Python
License
MIT
Category
Last pushed
Mar 25, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/codepawl/turboquant-torch"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Compare
Higher-rated alternatives
LMCache/LMCache
Supercharge Your LLM with the Fastest KV Cache Layer
Zefan-Cai/KVCache-Factory
Unified KV Cache Compression Methods for Auto-Regressive Models
dataflowr/llm_efficiency
KV Cache & LoRA for minGPT
OnlyTerp/kvtc
First open-source KVTC implementation (NVIDIA, ICLR 2026) -- 8-32x KV cache compression via PCA...
itsnamgyu/block-transformer
Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)