ModelCloud/GPTQModel

LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU via HF, vLLM, and SGLang.

83
/ 100
Verified

This toolkit helps you reduce the size of large language models (LLMs) so they can run more efficiently on your existing hardware, from powerful Nvidia GPUs to standard Intel/AMD/Apple CPUs. It takes a pre-trained, uncompressed LLM as input and produces a smaller, quantized version that consumes less memory and often runs faster. Data scientists, machine learning engineers, and researchers who deploy LLMs in resource-constrained environments will find this particularly useful.

1,044 stars. Used by 3 other packages. Actively maintained with 190 commits in the last 30 days. Available on PyPI.

Use this if you need to deploy large language models on devices with limited memory or computational power, or if you want to speed up inference for your LLM applications.

Not ideal if you primarily work with small models that don't benefit significantly from quantization, or if you require absolute peak accuracy and cannot tolerate any potential, albeit minor, performance degradation from compression.

large-language-models model-deployment edge-ai ml-optimization resource-management
Maintenance 22 / 25
Adoption 13 / 25
Maturity 25 / 25
Community 23 / 25

How are scores calculated?

Stars

1,044

Forks

166

Language

Python

License

Last pushed

Mar 13, 2026

Commits (30d)

190

Dependencies

21

Reverse dependents

3

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/ModelCloud/GPTQModel"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.