intel/auto-round

🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality degradation across Weight-Only Quantization, MXFP4, NVFP4, GGUF, and adaptive schemes.

/ 100

Verified

This tool helps AI engineers and machine learning practitioners reduce the computational resources needed to run large language models (LLMs) and vision-language models (VLMs) without significantly sacrificing their performance. You feed it your large AI model, and it outputs a smaller, highly optimized version that runs faster and with less memory. It's designed for those who deploy and manage AI models in production.

883 stars. Actively maintained with 85 commits in the last 30 days. Available on PyPI.

Use this if you need to deploy large AI models more efficiently, reducing their size and speeding up inference while maintaining high accuracy, especially on diverse hardware.

Not ideal if you are a data scientist primarily focused on model training and experimentation, as this tool is geared towards post-training optimization for deployment.

AI model deployment Large Language Models (LLMs) Vision-Language Models (VLMs) AI inference optimization Machine learning engineering

Maintenance 22 / 25

Adoption 10 / 25

Maturity 25 / 25

Community 18 / 25

How are scores calculated?

Stars

883

Forks

Language

Python

License

Apache-2.0

Related models

ModelCloud/GPTQModel

LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD...

pytorch/ao

PyTorch native quantization and sparsity for training and inference

bodaay/HuggingFaceModelDownloader

Simple go utility to download HuggingFace Models and Datasets

NVIDIA/kvpress

LLM KV cache compression made easy

BlinkDL/RWKV-LM

RWKV (pronounced RwaKuv) is an RNN with great LLM performance, which can also be directly...

Explore Transformer Models

All categories Trending Transformer directory Insights