egaoharu-kensei/flash-attention-triton
Cross-platform FlashAttention-2 Triton implementation for Turing+ GPUs with custom configuration mode
This project helps machine learning engineers and researchers accelerate their large language model training by speeding up the 'attention' mechanism. It takes in query, key, and value tensors from your model and outputs faster, more efficient attention computations. The primary users are those working on deep learning models, especially large transformers, who need to optimize performance on NVIDIA GPUs.
Available on PyPI.
Use this if you are training large language models or other transformer-based models and need to significantly boost the speed of your attention calculations on NVIDIA GPUs (Turing or newer architectures).
Not ideal if you are not working with deep learning models that use the attention mechanism or if you are not using compatible NVIDIA GPUs.
Stars
21
Forks
—
Language
Python
License
MIT
Category
Last pushed
Jan 12, 2026
Commits (30d)
0
Dependencies
1
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/egaoharu-kensei/flash-attention-triton"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
fla-org/flash-linear-attention
🚀 Efficient implementations of state-of-the-art linear attention models
thu-ml/SageAttention
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x...
thu-ml/SpargeAttn
[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.
fla-org/flame
🔥 A minimal training framework for scaling FLA models
foundation-model-stack/fms-fsdp
🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for...