Bruce-Lee-LY/decoding_attention

Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.

/ 100

Emerging

This project helps AI engineers speed up the 'decoding' stage of large language model (LLM) inference, which is crucial for generating responses efficiently. It takes in various attention mechanism configurations (like MHA, MQA, GQA, MLA) and processes them on NVIDIA GPUs, outputting faster computation of attention mechanisms. This is for AI/ML engineers working with LLMs, especially those deploying models for real-time applications.

No commits in the last 6 months.

Use this if you are an AI/ML engineer looking to optimize the inference speed of your LLMs, particularly during the token generation (decoding) phase, and especially for scenarios involving shorter sequence lengths or different attention types beyond standard Flash Attention.

Not ideal if your primary concern is LLM training performance or if you exclusively work with very long sequence lengths where other optimized libraries like Flash Attention might already excel.

LLM inference GPU optimization Deep learning deployment AI model acceleration Natural Language Processing

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 8 / 25

Maturity 16 / 25

Community 9 / 25

How are scores calculated?

Stars

Forks

Language

C++

License

BSD-3-Clause

Higher-rated alternatives

fla-org/flash-linear-attention

🚀 Efficient implementations of state-of-the-art linear attention models

thu-ml/SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x...

thu-ml/SpargeAttn

[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.

fla-org/flame

🔥 A minimal training framework for scaling FLA models

foundation-model-stack/fms-fsdp

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for...

Explore Transformer Models

All categories Trending Transformer directory Insights