ModelTC/QLLM
[ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models"
QLLM helps machine learning engineers and researchers make large language models (LLMs) run more efficiently without losing accuracy. It takes an existing, unquantized LLM and converts it into a smaller, faster version, making it more practical for deployment and use on less powerful hardware. This tool is for those who deploy or research state-of-the-art language models.
No commits in the last 6 months.
Use this if you need to reduce the computational resources (memory, processing power) required to run large language models while maintaining their performance.
Not ideal if you are a casual user of LLMs or don't have experience with model optimization and deployment.
Stars
39
Forks
5
Language
Python
License
Apache-2.0
Category
Last pushed
Mar 11, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/ModelTC/QLLM"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
ModelCloud/GPTQModel
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD...
intel/auto-round
🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality...
pytorch/ao
PyTorch native quantization and sparsity for training and inference
bodaay/HuggingFaceModelDownloader
Simple go utility to download HuggingFace Models and Datasets
NVIDIA/kvpress
LLM KV cache compression made easy