Bruce-Lee-LY/decoding_attention
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
This project helps AI engineers speed up the 'decoding' stage of large language model (LLM) inference, which is crucial for generating responses efficiently. It takes in various attention mechanism configurations (like MHA, MQA, GQA, MLA) and processes them on NVIDIA GPUs, outputting faster computation of attention mechanisms. This is for AI/ML engineers working with LLMs, especially those deploying models for real-time applications.
No commits in the last 6 months.
Use this if you are an AI/ML engineer looking to optimize the inference speed of your LLMs, particularly during the token generation (decoding) phase, and especially for scenarios involving shorter sequence lengths or different attention types beyond standard Flash Attention.
Not ideal if your primary concern is LLM training performance or if you exclusively work with very long sequence lengths where other optimized libraries like Flash Attention might already excel.
Stars
46
Forks
4
Language
C++
License
BSD-3-Clause
Category
Last pushed
Jun 11, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/Bruce-Lee-LY/decoding_attention"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
fla-org/flash-linear-attention
🚀 Efficient implementations of state-of-the-art linear attention models
thu-ml/SageAttention
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x...
thu-ml/SpargeAttn
[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.
fla-org/flame
🔥 A minimal training framework for scaling FLA models
foundation-model-stack/fms-fsdp
🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for...