Aaronhuang-778/BiLLM
[ICML 2024] BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
BiLLM helps machine learning engineers and researchers reduce the memory and computational demands of large language models (LLMs). It takes existing pretrained LLMs (like LLaMA or OPT) and converts their vast numerical weights into a highly compressed 1-bit format. This results in much smaller models that still perform accurately, making them more practical for deployment on devices with limited resources.
228 stars. No commits in the last 6 months.
Use this if you need to significantly shrink the size and computational requirements of large language models while maintaining high accuracy, especially for deployment on resource-constrained hardware.
Not ideal if you are not working with large language models or if you require an extremely high-precision model where even minimal accuracy trade-offs are unacceptable.
Stars
228
Forks
17
Language
Python
License
MIT
Category
Last pushed
Jan 11, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/Aaronhuang-778/BiLLM"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
bitsandbytes-foundation/bitsandbytes
Accessible large language models via k-bit quantization for PyTorch.
intel/neural-compressor
SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model...
dropbox/hqq
Official implementation of Half-Quadratic Quantization (HQQ)
OpenGVLab/OmniQuant
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
Hsu1023/DuQuant
[NeurIPS 2024 Oral🔥] DuQuant: Distributing Outliers via Dual Transformation Makes Stronger...