xvyaward/owq
Code for the AAAI 2024 Oral paper "OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models".
This project helps machine learning engineers and researchers make large language models (LLMs) like LLaMA and BLOOM more efficient without losing quality. It takes an existing LLM and quantizes its weights to 3 or 4 bits, while preserving crucial 'outlier' columns at higher precision. The output is a smaller, faster LLM that can be fine-tuned and used for inference with significantly reduced memory and computational requirements.
No commits in the last 6 months.
Use this if you need to run large language models on hardware with limited memory or computational power, or if you want to speed up inference and fine-tuning of LLMs while maintaining high accuracy.
Not ideal if you are working with smaller models that don't benefit as much from aggressive quantization, or if your hardware is not a NVIDIA A100/A6000/RTX3090, as kernel performance may be suboptimal.
Stars
69
Forks
8
Language
Python
License
—
Category
Last pushed
Mar 07, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/xvyaward/owq"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
bitsandbytes-foundation/bitsandbytes
Accessible large language models via k-bit quantization for PyTorch.
intel/neural-compressor
SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model...
dropbox/hqq
Official implementation of Half-Quadratic Quantization (HQQ)
OpenGVLab/OmniQuant
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
Hsu1023/DuQuant
[NeurIPS 2024 Oral🔥] DuQuant: Distributing Outliers via Dual Transformation Makes Stronger...