thu-ml/SageAttention
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
SageAttention significantly speeds up the inference of large AI models, like those used for language, image, and video generation. It takes your existing AI model weights and outputs the same model, but one that runs 2-5 times faster on modern NVIDIA GPUs without losing accuracy. This is designed for AI engineers and machine learning practitioners who deploy and run large AI models.
3,213 stars.
Use this if you need to accelerate the speed of your deployed AI models for language, image, or video tasks on NVIDIA GPUs (Ampere, Ada, Hopper) without sacrificing output quality.
Not ideal if you are still in the model development or training phase, or if you primarily work with older GPU architectures or non-NVIDIA hardware.
Stars
3,213
Forks
366
Language
Cuda
License
Apache-2.0
Category
Last pushed
Jan 17, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/thu-ml/SageAttention"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related models
fla-org/flash-linear-attention
🚀 Efficient implementations of state-of-the-art linear attention models
thu-ml/SpargeAttn
[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.
fla-org/flame
🔥 A minimal training framework for scaling FLA models
foundation-model-stack/fms-fsdp
🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for...
NX-AI/mlstm_kernels
Tiled Flash Linear Attention library for fast and efficient mLSTM Kernels.