pytorch/ao

PyTorch native quantization and sparsity for training and inference

/ 100

Verified

This project helps machine learning engineers and researchers make large language models (LLMs) and other deep learning models run much faster and use less memory. By optimizing models, you can get significantly quicker results for both training and deploying your models, which is crucial when working with powerful but resource-intensive models like Llama. You'll put in a PyTorch model and get out a more efficient, optimized version ready for faster use or training.

2,729 stars. Actively maintained with 121 commits in the last 30 days.

Use this if you need to accelerate model training or inference, reduce memory consumption, and deploy large models more efficiently.

Not ideal if you are working with small models where performance optimization isn't a primary concern or if you need to maintain absolute floating-point precision without any trade-offs.

large-language-models deep-learning-optimization model-deployment machine-learning-engineering AI-inference

No Package No Dependents

Maintenance 22 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 23 / 25

How are scores calculated?

Stars

2,729

Forks

456

Language

Python

License

—

Related models

ModelCloud/GPTQModel

LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD...

intel/auto-round

🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality...

bodaay/HuggingFaceModelDownloader

Simple go utility to download HuggingFace Models and Datasets

NVIDIA/kvpress

LLM KV cache compression made easy

BlinkDL/RWKV-LM

RWKV (pronounced RwaKuv) is an RNN with great LLM performance, which can also be directly...

Explore Transformer Models

All categories Trending Transformer directory Insights