MaxLSB/flash-attn2

FlashAttention for sliding window attention in Triton (fwd + bwd pass)

/ 100

Emerging

This project helps machine learning engineers accelerate the core 'attention' mechanism in large language models. It takes your model's attention computations and processes them much faster on NVIDIA GPUs. The result is significantly quicker training and inference for models that use sliding window, global, or causal attention, making your LLM workflows more efficient.

No commits in the last 6 months.

Use this if you are developing or training large language models and need to speed up the attention computation on NVIDIA GPUs, especially for models employing sliding window attention.

Not ideal if you are not working with large language models, do not have access to NVIDIA GPUs, or require features like dropout support or other attention mechanisms.

large-language-models deep-learning-optimization gpu-acceleration natural-language-processing neural-network-training

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 5 / 25

Maturity 16 / 25

Community 7 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Higher-rated alternatives

triton-inference-server/server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.

gpu-mode/Triton-Puzzles

Puzzles for learning Triton

hailo-ai/hailo_model_zoo

The Hailo Model Zoo includes pre-trained models and a full building and evaluation environment

open-mmlab/mmdeploy

OpenMMLab Model Deployment Framework

hyperai/tvm-cn

TVM Documentation in Chinese Simplified / TVM 中文文档

Explore ML Frameworks

All categories Trending ML Framework directory Insights