Bruce-Lee-LY/decoding_attention

Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.

35
/ 100
Emerging

This project helps AI engineers speed up the 'decoding' stage of large language model (LLM) inference, which is crucial for generating responses efficiently. It takes in various attention mechanism configurations (like MHA, MQA, GQA, MLA) and processes them on NVIDIA GPUs, outputting faster computation of attention mechanisms. This is for AI/ML engineers working with LLMs, especially those deploying models for real-time applications.

No commits in the last 6 months.

Use this if you are an AI/ML engineer looking to optimize the inference speed of your LLMs, particularly during the token generation (decoding) phase, and especially for scenarios involving shorter sequence lengths or different attention types beyond standard Flash Attention.

Not ideal if your primary concern is LLM training performance or if you exclusively work with very long sequence lengths where other optimized libraries like Flash Attention might already excel.

LLM inference GPU optimization Deep learning deployment AI model acceleration Natural Language Processing
Stale 6m No Package No Dependents
Maintenance 2 / 25
Adoption 8 / 25
Maturity 16 / 25
Community 9 / 25

How are scores calculated?

Stars

46

Forks

4

Language

C++

License

BSD-3-Clause

Last pushed

Jun 11, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/Bruce-Lee-LY/decoding_attention"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.