ModelCloud/GPTQModel
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU via HF, vLLM, and SGLang.
This toolkit helps you reduce the size of large language models (LLMs) so they can run more efficiently on your existing hardware, from powerful Nvidia GPUs to standard Intel/AMD/Apple CPUs. It takes a pre-trained, uncompressed LLM as input and produces a smaller, quantized version that consumes less memory and often runs faster. Data scientists, machine learning engineers, and researchers who deploy LLMs in resource-constrained environments will find this particularly useful.
1,044 stars. Used by 3 other packages. Actively maintained with 190 commits in the last 30 days. Available on PyPI.
Use this if you need to deploy large language models on devices with limited memory or computational power, or if you want to speed up inference for your LLM applications.
Not ideal if you primarily work with small models that don't benefit significantly from quantization, or if you require absolute peak accuracy and cannot tolerate any potential, albeit minor, performance degradation from compression.
Stars
1,044
Forks
166
Language
Python
License
—
Category
Last pushed
Mar 13, 2026
Commits (30d)
190
Dependencies
21
Reverse dependents
3
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/ModelCloud/GPTQModel"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Compare
Related models
intel/auto-round
🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality...
pytorch/ao
PyTorch native quantization and sparsity for training and inference
bodaay/HuggingFaceModelDownloader
Simple go utility to download HuggingFace Models and Datasets
NVIDIA/kvpress
LLM KV cache compression made easy
BlinkDL/RWKV-LM
RWKV (pronounced RwaKuv) is an RNN with great LLM performance, which can also be directly...