intel/auto-round
🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality degradation across Weight-Only Quantization, MXFP4, NVFP4, GGUF, and adaptive schemes.
This tool helps AI engineers and machine learning practitioners reduce the computational resources needed to run large language models (LLMs) and vision-language models (VLMs) without significantly sacrificing their performance. You feed it your large AI model, and it outputs a smaller, highly optimized version that runs faster and with less memory. It's designed for those who deploy and manage AI models in production.
883 stars. Actively maintained with 85 commits in the last 30 days. Available on PyPI.
Use this if you need to deploy large AI models more efficiently, reducing their size and speeding up inference while maintaining high accuracy, especially on diverse hardware.
Not ideal if you are a data scientist primarily focused on model training and experimentation, as this tool is geared towards post-training optimization for deployment.
Stars
883
Forks
81
Language
Python
License
Apache-2.0
Category
Last pushed
Mar 13, 2026
Commits (30d)
85
Dependencies
8
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/intel/auto-round"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related models
ModelCloud/GPTQModel
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD...
pytorch/ao
PyTorch native quantization and sparsity for training and inference
bodaay/HuggingFaceModelDownloader
Simple go utility to download HuggingFace Models and Datasets
NVIDIA/kvpress
LLM KV cache compression made easy
BlinkDL/RWKV-LM
RWKV (pronounced RwaKuv) is an RNN with great LLM performance, which can also be directly...