Bruce-Lee-LY/flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
When building and deploying large language models, the 'attention' mechanism is crucial for generating responses. This project provides a C++ implementation of Flash Attention to speed up how these models process information during real-time use. It takes in the model's attention queries, keys, and values, and outputs the processed attention result much faster, benefitting engineers focused on optimizing LLM deployment.
No commits in the last 6 months.
Use this if you are a machine learning engineer or MLOps specialist looking to accelerate the inference speed of your large language models, especially when dealing with complex attention mechanisms like Group Query Attention or Attention with Linear Biases.
Not ideal if you are looking for a high-level Python library for LLM development or if you do not have C++ and CUDA development experience.
Stars
43
Forks
6
Language
C++
License
BSD-3-Clause
Category
Last pushed
Feb 27, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/Bruce-Lee-LY/flash_attention_inference"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
fla-org/flash-linear-attention
🚀 Efficient implementations of state-of-the-art linear attention models
thu-ml/SageAttention
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x...
thu-ml/SpargeAttn
[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.
fla-org/flame
🔥 A minimal training framework for scaling FLA models
foundation-model-stack/fms-fsdp
🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for...