huawei-csl/SINQ

Welcome to the official repository of SINQ! A novel, fast and high-quality quantization method designed to make any Large Language Model smaller while preserving accuracy.

/ 100

Established

This project helps machine learning engineers and MLOps professionals deploy large language models (LLMs) more efficiently. It takes an existing LLM and reduces its memory footprint without sacrificing accuracy, allowing you to run very large models on GPUs with limited memory. The output is a smaller, high-performing LLM ready for inference.

602 stars. Available on PyPI.

Use this if you need to run large language models on GPUs with limited memory or want to significantly speed up the quantization process for deployment.

Not ideal if you are working with smaller models where memory is not a constraint or if you require end-to-end training during the quantization process.

LLM deployment model optimization GPU resource management machine learning inference MLOps

Maintenance 10 / 25

Adoption 10 / 25

Maturity 24 / 25

Community 16 / 25

How are scores calculated?

Stars

602

Forks

Language

Python

License

Apache-2.0

Related tools

SILX-LABS/QUASAR-SUBNET

QUASAR is a long-context foundation model and decentralized evaluation subnet built on Bittensor,

stackblogger/bitnet.js

BitNet.Js - A node.js implementation of the microsoft bitnet.cpp inference framework.

m96-chan/0xBitNet

Run BitNet b1.58 ternary LLMs with WebGPU — in browsers and native apps

AnswerDotAI/cold-compress

Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking...

FMInference/H2O

[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.

Explore LLM Tools

All categories Trending LLM Tool directory Insights