SqueezeAILab/SqueezeLLM
[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization
This project helps machine learning engineers and MLOps specialists deploy large language models (LLMs) more efficiently. It takes existing LLM weights (like LLaMA, Vicuna, or Mistral) and processes them to produce smaller, optimized model weights. The result is an LLM that requires significantly less memory to run, while often maintaining or even improving its accuracy and speed.
713 stars. No commits in the last 6 months.
Use this if you are struggling to deploy large language models due to high memory requirements on your GPU infrastructure, but want to maintain or improve model performance.
Not ideal if you are working with small models that don't have significant memory footprint issues or if you don't require the absolute best performance metrics for your LLM.
Stars
713
Forks
49
Language
Python
License
MIT
Category
Last pushed
Aug 13, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/SqueezeAILab/SqueezeLLM"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Compare
Higher-rated alternatives
ModelCloud/GPTQModel
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD...
intel/auto-round
🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality...
pytorch/ao
PyTorch native quantization and sparsity for training and inference
bodaay/HuggingFaceModelDownloader
Simple go utility to download HuggingFace Models and Datasets
NVIDIA/kvpress
LLM KV cache compression made easy