megvii-research/IntLLaMA
IntLLaMA: A fast and light quantization solution for LLaMA
This helps AI engineers and researchers reduce the memory footprint and speed up large language models like LLaMA without losing much performance. It takes a full-precision language model as input and outputs a much smaller, faster, and quantized version. This is for machine learning practitioners who deploy or experiment with large AI models on hardware with limited resources.
No commits in the last 6 months.
Use this if you need to run large language models more efficiently on GPUs with less memory, making them faster and more accessible.
Not ideal if you primarily need a general-purpose fine-tuning library or are working with models other than LLaMA or ChatGLMv2.
Stars
18
Forks
—
Language
Python
License
Apache-2.0
Category
Last pushed
Jul 21, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/megvii-research/IntLLaMA"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
ModelCloud/GPTQModel
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD...
intel/auto-round
🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality...
pytorch/ao
PyTorch native quantization and sparsity for training and inference
bodaay/HuggingFaceModelDownloader
Simple go utility to download HuggingFace Models and Datasets
NVIDIA/kvpress
LLM KV cache compression made easy