OpenGVLab/OmniQuant
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
This project helps machine learning practitioners compress large language models (LLMs) like LLaMA and Falcon. It takes an existing, large LLM and outputs a smaller, quantized version that uses less memory and can run on less powerful hardware, including mobile phones. Data scientists and AI/ML engineers who work with LLMs and need to deploy them efficiently will find this useful.
890 stars.
Use this if you need to reduce the memory footprint and enable more efficient deployment of large language models on resource-constrained devices or with limited GPU memory.
Not ideal if you are working with smaller, non-LLM models or if memory footprint is not a critical concern for your deployment.
Stars
890
Forks
76
Language
Python
License
MIT
Category
Last pushed
Nov 26, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/OpenGVLab/OmniQuant"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
bitsandbytes-foundation/bitsandbytes
Accessible large language models via k-bit quantization for PyTorch.
intel/neural-compressor
SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model...
dropbox/hqq
Official implementation of Half-Quadratic Quantization (HQQ)
Hsu1023/DuQuant
[NeurIPS 2024 Oral🔥] DuQuant: Distributing Outliers via Dual Transformation Makes Stronger...
VITA-Group/Q-GaLore
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.