thu-ml/SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

57
/ 100
Established

SageAttention significantly speeds up the inference of large AI models, like those used for language, image, and video generation. It takes your existing AI model weights and outputs the same model, but one that runs 2-5 times faster on modern NVIDIA GPUs without losing accuracy. This is designed for AI engineers and machine learning practitioners who deploy and run large AI models.

3,213 stars.

Use this if you need to accelerate the speed of your deployed AI models for language, image, or video tasks on NVIDIA GPUs (Ampere, Ada, Hopper) without sacrificing output quality.

Not ideal if you are still in the model development or training phase, or if you primarily work with older GPU architectures or non-NVIDIA hardware.

AI model deployment inference optimization large language models computer vision generative AI
No Package No Dependents
Maintenance 10 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 21 / 25

How are scores calculated?

Stars

3,213

Forks

366

Language

Cuda

License

Apache-2.0

Last pushed

Jan 17, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/thu-ml/SageAttention"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.