egaoharu-kensei/flash-attention-triton

Cross-platform FlashAttention-2 Triton implementation for Turing+ GPUs with custom configuration mode

/ 100

Emerging

This project helps machine learning engineers and researchers accelerate their large language model training by speeding up the 'attention' mechanism. It takes in query, key, and value tensors from your model and outputs faster, more efficient attention computations. The primary users are those working on deep learning models, especially large transformers, who need to optimize performance on NVIDIA GPUs.

Available on PyPI.

Use this if you are training large language models or other transformer-based models and need to significantly boost the speed of your attention calculations on NVIDIA GPUs (Turing or newer architectures).

Not ideal if you are not working with deep learning models that use the attention mechanism or if you are not using compatible NVIDIA GPUs.

deep-learning large-language-models transformer-architecture gpu-optimization model-training

Maintenance 6 / 25

Adoption 6 / 25

Maturity 22 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

Python

License

MIT

Higher-rated alternatives

fla-org/flash-linear-attention

🚀 Efficient implementations of state-of-the-art linear attention models

thu-ml/SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x...

thu-ml/SpargeAttn

[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.

fla-org/flame

🔥 A minimal training framework for scaling FLA models

foundation-model-stack/fms-fsdp

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for...

Explore Transformer Models

All categories Trending Transformer directory Insights