intel/neural-compressor
SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, and ONNX Runtime
This is a tool for machine learning engineers and AI researchers who are working with large deep learning models. It helps reduce the size and computational demands of models like Large Language Models (LLMs) and Vision-Language Models (VLMs). You input a trained deep learning model, and it outputs a more compact, faster-running version of that model, ready for deployment on various Intel hardware, as well as some AMD, ARM, and Nvidia platforms.
2,597 stars. Used by 1 other package. Actively maintained with 14 commits in the last 30 days. Available on PyPI.
Use this if you need to optimize your large deep learning models for faster inference and reduced memory footprint, especially when deploying on Intel CPUs, GPUs, or Habana Gaudi AI accelerators.
Not ideal if you are not working with deep learning models or if your primary goal is not model compression for deployment on specialized AI hardware.
Stars
2,597
Forks
298
Language
Python
License
Apache-2.0
Category
Last pushed
Mar 13, 2026
Commits (30d)
14
Dependencies
14
Reverse dependents
1
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/intel/neural-compressor"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related models
bitsandbytes-foundation/bitsandbytes
Accessible large language models via k-bit quantization for PyTorch.
dropbox/hqq
Official implementation of Half-Quadratic Quantization (HQQ)
OpenGVLab/OmniQuant
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
Hsu1023/DuQuant
[NeurIPS 2024 Oral🔥] DuQuant: Distributing Outliers via Dual Transformation Makes Stronger...
VITA-Group/Q-GaLore
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.