Bruce-Lee-LY/flash_attention_inference

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

/ 100

Emerging

When building and deploying large language models, the 'attention' mechanism is crucial for generating responses. This project provides a C++ implementation of Flash Attention to speed up how these models process information during real-time use. It takes in the model's attention queries, keys, and values, and outputs the processed attention result much faster, benefitting engineers focused on optimizing LLM deployment.

No commits in the last 6 months.

Use this if you are a machine learning engineer or MLOps specialist looking to accelerate the inference speed of your large language models, especially when dealing with complex attention mechanisms like Group Query Attention or Attention with Linear Biases.

Not ideal if you are looking for a high-level Python library for LLM development or if you do not have C++ and CUDA development experience.

LLM deployment ML inference optimization GPU acceleration AI model performance

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 8 / 25

Maturity 16 / 25

Community 13 / 25

How are scores calculated?

Stars

Forks

Language

C++

License

BSD-3-Clause

Compare

flash_attention_inference and flash-linear-attention

Higher-rated alternatives

fla-org/flash-linear-attention

🚀 Efficient implementations of state-of-the-art linear attention models

thu-ml/SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x...

thu-ml/SpargeAttn

[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.

fla-org/flame

🔥 A minimal training framework for scaling FLA models

foundation-model-stack/fms-fsdp

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for...

Explore Transformer Models

All categories Trending Transformer directory Insights