Bruce-Lee-LY/flash_attention_inference

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

37
/ 100
Emerging

When building and deploying large language models, the 'attention' mechanism is crucial for generating responses. This project provides a C++ implementation of Flash Attention to speed up how these models process information during real-time use. It takes in the model's attention queries, keys, and values, and outputs the processed attention result much faster, benefitting engineers focused on optimizing LLM deployment.

No commits in the last 6 months.

Use this if you are a machine learning engineer or MLOps specialist looking to accelerate the inference speed of your large language models, especially when dealing with complex attention mechanisms like Group Query Attention or Attention with Linear Biases.

Not ideal if you are looking for a high-level Python library for LLM development or if you do not have C++ and CUDA development experience.

LLM deployment ML inference optimization GPU acceleration AI model performance
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 8 / 25
Maturity 16 / 25
Community 13 / 25

How are scores calculated?

Stars

43

Forks

6

Language

C++

License

BSD-3-Clause

Last pushed

Feb 27, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/Bruce-Lee-LY/flash_attention_inference"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.