taishan1994/LLM-Quantization

记录量化LLM中的总结。

/ 100

Emerging

This is a collection of summaries and practical examples about quantizing Large Language Models (LLMs). It provides insights into techniques for reducing the computational resources and memory needed to run these models, without significantly impacting performance. Researchers and machine learning engineers working with LLMs will find valuable information here.

Use this if you are a machine learning engineer or researcher looking to optimize the deployment and inference of large language models by reducing their size and computational demands.

Not ideal if you are looking for an off-the-shelf software tool for immediate use, as this project is a compilation of research and practical notes, not a direct implementation.

LLM deployment model optimization deep learning inference computational efficiency AI model compression

No License No Package No Dependents

Maintenance 6 / 25

Adoption 8 / 25

Maturity 7 / 25

Community 13 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

bitsandbytes-foundation/bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.

intel/neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model...

dropbox/hqq

Official implementation of Half-Quadratic Quantization (HQQ)

OpenGVLab/OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.

Hsu1023/DuQuant

[NeurIPS 2024 Oral🔥] DuQuant: Distributing Outliers via Dual Transformation Makes Stronger...

Explore Transformer Models

All categories Trending Transformer directory Insights