SqueezeAILab/SqueezeLLM

[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization

/ 100

Emerging

This project helps machine learning engineers and MLOps specialists deploy large language models (LLMs) more efficiently. It takes existing LLM weights (like LLaMA, Vicuna, or Mistral) and processes them to produce smaller, optimized model weights. The result is an LLM that requires significantly less memory to run, while often maintaining or even improving its accuracy and speed.

713 stars. No commits in the last 6 months.

Use this if you are struggling to deploy large language models due to high memory requirements on your GPU infrastructure, but want to maintain or improve model performance.

Not ideal if you are working with small models that don't have significant memory footprint issues or if you don't require the absolute best performance metrics for your LLM.

LLM deployment model optimization GPU efficiency AI infrastructure machine learning operations

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 15 / 25

How are scores calculated?

Stars

713

Forks

Language

Python

License

MIT

Compare

SqueezeLLM and picollm

Higher-rated alternatives

ModelCloud/GPTQModel

LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD...

intel/auto-round

🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality...

pytorch/ao

PyTorch native quantization and sparsity for training and inference

bodaay/HuggingFaceModelDownloader

Simple go utility to download HuggingFace Models and Datasets

NVIDIA/kvpress

LLM KV cache compression made easy

Explore Transformer Models

All categories Trending Transformer directory Insights