Aaronhuang-778/BiLLM

[ICML 2024] BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

/ 100

Emerging

BiLLM helps machine learning engineers and researchers reduce the memory and computational demands of large language models (LLMs). It takes existing pretrained LLMs (like LLaMA or OPT) and converts their vast numerical weights into a highly compressed 1-bit format. This results in much smaller models that still perform accurately, making them more practical for deployment on devices with limited resources.

228 stars. No commits in the last 6 months.

Use this if you need to significantly shrink the size and computational requirements of large language models while maintaining high accuracy, especially for deployment on resource-constrained hardware.

Not ideal if you are not working with large language models or if you require an extremely high-precision model where even minimal accuracy trade-offs are unacceptable.

large-language-models model-compression edge-ai deep-learning-optimization nlp-deployment

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 13 / 25

How are scores calculated?

Stars

228

Forks

Language

Python

License

MIT

Higher-rated alternatives

bitsandbytes-foundation/bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.

intel/neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model...

dropbox/hqq

Official implementation of Half-Quadratic Quantization (HQQ)

OpenGVLab/OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.

Hsu1023/DuQuant

[NeurIPS 2024 Oral🔥] DuQuant: Distributing Outliers via Dual Transformation Makes Stronger...

Explore Transformer Models

All categories Trending Transformer directory Insights