OpenGVLab/OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.

/ 100

Emerging

This project helps machine learning practitioners compress large language models (LLMs) like LLaMA and Falcon. It takes an existing, large LLM and outputs a smaller, quantized version that uses less memory and can run on less powerful hardware, including mobile phones. Data scientists and AI/ML engineers who work with LLMs and need to deploy them efficiently will find this useful.

890 stars.

Use this if you need to reduce the memory footprint and enable more efficient deployment of large language models on resource-constrained devices or with limited GPU memory.

Not ideal if you are working with smaller, non-LLM models or if memory footprint is not a critical concern for your deployment.

large-language-models model-compression edge-ai model-deployment resource-optimization

No Package No Dependents

Maintenance 6 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 17 / 25

How are scores calculated?

Stars

890

Forks

Language

Python

License

MIT

Higher-rated alternatives

bitsandbytes-foundation/bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.

intel/neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model...

dropbox/hqq

Official implementation of Half-Quadratic Quantization (HQQ)

Hsu1023/DuQuant

[NeurIPS 2024 Oral🔥] DuQuant: Distributing Outliers via Dual Transformation Makes Stronger...

VITA-Group/Q-GaLore

Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.

Explore Transformer Models

All categories Trending Transformer directory Insights