ModelCloud/GPTQModel

LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU via HF, vLLM, and SGLang.

/ 100

Verified

This toolkit helps you reduce the size of large language models (LLMs) so they can run more efficiently on your existing hardware, from powerful Nvidia GPUs to standard Intel/AMD/Apple CPUs. It takes a pre-trained, uncompressed LLM as input and produces a smaller, quantized version that consumes less memory and often runs faster. Data scientists, machine learning engineers, and researchers who deploy LLMs in resource-constrained environments will find this particularly useful.

1,044 stars. Used by 3 other packages. Actively maintained with 190 commits in the last 30 days. Available on PyPI.

Use this if you need to deploy large language models on devices with limited memory or computational power, or if you want to speed up inference for your LLM applications.

Not ideal if you primarily work with small models that don't benefit significantly from quantization, or if you require absolute peak accuracy and cannot tolerate any potential, albeit minor, performance degradation from compression.

large-language-models model-deployment edge-ai ml-optimization resource-management

Maintenance 22 / 25

Adoption 13 / 25

Maturity 25 / 25

Community 23 / 25

How are scores calculated?

Stars

1,044

Forks

166

Language

Python

License

—

Compare

GPTQModel and AutoGPTQ

Related models

intel/auto-round

🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality...

pytorch/ao

PyTorch native quantization and sparsity for training and inference

bodaay/HuggingFaceModelDownloader

Simple go utility to download HuggingFace Models and Datasets

NVIDIA/kvpress

LLM KV cache compression made easy

BlinkDL/RWKV-LM

RWKV (pronounced RwaKuv) is an RNN with great LLM performance, which can also be directly...

Explore Transformer Models

All categories Trending Transformer directory Insights