xvyaward/owq

Code for the AAAI 2024 Oral paper "OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models".

/ 100

Experimental

This project helps machine learning engineers and researchers make large language models (LLMs) like LLaMA and BLOOM more efficient without losing quality. It takes an existing LLM and quantizes its weights to 3 or 4 bits, while preserving crucial 'outlier' columns at higher precision. The output is a smaller, faster LLM that can be fine-tuned and used for inference with significantly reduced memory and computational requirements.

No commits in the last 6 months.

Use this if you need to run large language models on hardware with limited memory or computational power, or if you want to speed up inference and fine-tuning of LLMs while maintaining high accuracy.

Not ideal if you are working with smaller models that don't benefit as much from aggressive quantization, or if your hardware is not a NVIDIA A100/A6000/RTX3090, as kernel performance may be suboptimal.

large-language-models model-optimization deep-learning-inference fine-tuning resource-efficiency

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 8 / 25

Maturity 8 / 25

Community 13 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

bitsandbytes-foundation/bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.

intel/neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model...

dropbox/hqq

Official implementation of Half-Quadratic Quantization (HQQ)

OpenGVLab/OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.

Hsu1023/DuQuant

[NeurIPS 2024 Oral🔥] DuQuant: Distributing Outliers via Dual Transformation Makes Stronger...

Explore Transformer Models

All categories Trending Transformer directory Insights