intel/neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, and ONNX Runtime

/ 100

Verified

This is a tool for machine learning engineers and AI researchers who are working with large deep learning models. It helps reduce the size and computational demands of models like Large Language Models (LLMs) and Vision-Language Models (VLMs). You input a trained deep learning model, and it outputs a more compact, faster-running version of that model, ready for deployment on various Intel hardware, as well as some AMD, ARM, and Nvidia platforms.

2,597 stars. Used by 1 other package. Actively maintained with 14 commits in the last 30 days. Available on PyPI.

Use this if you need to optimize your large deep learning models for faster inference and reduced memory footprint, especially when deploying on Intel CPUs, GPUs, or Habana Gaudi AI accelerators.

Not ideal if you are not working with deep learning models or if your primary goal is not model compression for deployment on specialized AI hardware.

deep-learning-optimization model-deployment large-language-models AI-inference edge-AI

Maintenance 17 / 25

Adoption 11 / 25

Maturity 25 / 25

Community 21 / 25

How are scores calculated?

Stars

2,597

Forks

298

Language

Python

License

Apache-2.0

Related models

bitsandbytes-foundation/bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.

dropbox/hqq

Official implementation of Half-Quadratic Quantization (HQQ)

OpenGVLab/OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.

Hsu1023/DuQuant

[NeurIPS 2024 Oral🔥] DuQuant: Distributing Outliers via Dual Transformation Makes Stronger...

VITA-Group/Q-GaLore

Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.

Explore Transformer Models

All categories Trending Transformer directory Insights