codepawl/turboquant-torch

Unofficial PyTorch implementation of TurboQuant (Google Research, ICLR 2026). Near-optimal vector quantization for KV cache compression and vector search. 3-bit with zero accuracy loss.

27
/ 100
Experimental

This tool helps AI practitioners and researchers dramatically reduce the memory footprint of large language models (LLMs) during inference. It takes your existing PyTorch LLM and compresses its internal memory (KV cache) or vector databases, outputting a model that uses significantly less RAM with virtually no loss in accuracy. This is designed for anyone running LLMs where memory efficiency is critical.

Use this if you need to run large AI models on hardware with limited memory, want to reduce inference costs, or speed up vector search operations.

Not ideal if you are working with extremely small models where memory is not a bottleneck or if your primary concern is maximum training speed rather than inference efficiency.

LLM deployment AI inference optimization vector search memory management resource-constrained AI
No Package No Dependents
Maintenance 13 / 25
Adoption 5 / 25
Maturity 9 / 25
Community 0 / 25

How are scores calculated?

Stars

9

Forks

Language

Python

License

MIT

Last pushed

Mar 25, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/codepawl/turboquant-torch"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.