NoakLiu/LLMEasyQuant

A Serving System for Distributed and Parallel LLM Quantization [Efficient ML System]

/ 100

Experimental

This project helps machine learning engineers and researchers make large language models (LLMs) run faster and use less memory without losing much accuracy. It takes an existing LLM, applies various compression techniques, and outputs a more efficient model ready for deployment. The primary users are those working on deploying LLMs to production or research environments where computational resources are a concern.

No commits in the last 6 months.

Use this if you need to optimize the performance and reduce the memory footprint of your Large Language Models for efficient deployment or research.

Not ideal if you are looking for a no-code solution or primarily work with traditional machine learning models outside of the LLM space.

LLM deployment model optimization machine learning engineering AI research computational efficiency

No License Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 7 / 25

Maturity 8 / 25

Community 4 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

bitsandbytes-foundation/bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.

intel/neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model...

dropbox/hqq

Official implementation of Half-Quadratic Quantization (HQQ)

OpenGVLab/OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.

Hsu1023/DuQuant

[NeurIPS 2024 Oral🔥] DuQuant: Distributing Outliers via Dual Transformation Makes Stronger...

Explore Transformer Models

All categories Trending Transformer directory Insights