Relaxed-System-Lab/Flash-Sparse-Attention

🚀🚀 Efficient implementations of Native Sparse Attention

/ 100

Emerging

This project offers an optimized way to train and run large language models (LLMs) more efficiently. It takes in standard LLM input data and processes it using a more performant attention mechanism, leading to faster computations and reduced memory use. Developers and AI engineers working on LLM training and deployment, especially those dealing with models requiring sparse attention, would find this useful.

983 stars. No commits in the last 6 months.

Use this if you are a machine learning engineer or researcher looking to speed up the training and inference of large language models, particularly those using sparse attention mechanisms on NVIDIA GPUs.

Not ideal if you are working with non-LLM models, do not require sparse attention, or are not using NVIDIA GPUs.

Large-Language-Models Deep-Learning-Optimization AI-Infrastructure Model-Training GPU-Computing

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 10 / 25

Maturity 15 / 25

Community 9 / 25

How are scores calculated?

Stars

983

Forks

Language

Python

License

Apache-2.0

Compare

Flash-Sparse-Attention and flash-linear-attention

Higher-rated alternatives

fla-org/flash-linear-attention

🚀 Efficient implementations of state-of-the-art linear attention models

thu-ml/SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x...

thu-ml/SpargeAttn

[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.

fla-org/flame

🔥 A minimal training framework for scaling FLA models

foundation-model-stack/fms-fsdp

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for...

Explore Transformer Models

All categories Trending Transformer directory Insights