thu-ml/SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

/ 100

Established

SageAttention significantly speeds up the inference of large AI models, like those used for language, image, and video generation. It takes your existing AI model weights and outputs the same model, but one that runs 2-5 times faster on modern NVIDIA GPUs without losing accuracy. This is designed for AI engineers and machine learning practitioners who deploy and run large AI models.

3,213 stars.

Use this if you need to accelerate the speed of your deployed AI models for language, image, or video tasks on NVIDIA GPUs (Ampere, Ada, Hopper) without sacrificing output quality.

Not ideal if you are still in the model development or training phase, or if you primarily work with older GPU architectures or non-NVIDIA hardware.

AI model deployment inference optimization large language models computer vision generative AI

No Package No Dependents

Maintenance 10 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 21 / 25

How are scores calculated?

Stars

3,213

Forks

366

Language

Cuda

License

Apache-2.0

Compare

SageAttention and flash-linear-attention SageAttention and SpargeAttn

Related models

fla-org/flash-linear-attention

🚀 Efficient implementations of state-of-the-art linear attention models

thu-ml/SpargeAttn

[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.

fla-org/flame

🔥 A minimal training framework for scaling FLA models

foundation-model-stack/fms-fsdp

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for...

NX-AI/mlstm_kernels

Tiled Flash Linear Attention library for fast and efficient mLSTM Kernels.

Explore Transformer Models

All categories Trending Transformer directory Insights