megvii-research/IntLLaMA

IntLLaMA: A fast and light quantization solution for LLaMA

/ 100

Experimental

This helps AI engineers and researchers reduce the memory footprint and speed up large language models like LLaMA without losing much performance. It takes a full-precision language model as input and outputs a much smaller, faster, and quantized version. This is for machine learning practitioners who deploy or experiment with large AI models on hardware with limited resources.

No commits in the last 6 months.

Use this if you need to run large language models more efficiently on GPUs with less memory, making them faster and more accessible.

Not ideal if you primarily need a general-purpose fine-tuning library or are working with models other than LLaMA or ChatGLMv2.

large-language-models model-optimization edge-ai ai-deployment resource-constrained-ai

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 6 / 25

Maturity 16 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

Python

License

Apache-2.0

Higher-rated alternatives

ModelCloud/GPTQModel

LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD...

intel/auto-round

🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality...

pytorch/ao

PyTorch native quantization and sparsity for training and inference

bodaay/HuggingFaceModelDownloader

Simple go utility to download HuggingFace Models and Datasets

NVIDIA/kvpress

LLM KV cache compression made easy

Explore Transformer Models

All categories Trending Transformer directory Insights