NVIDIA/Star-Attention
Efficient LLM Inference over Long Sequences
This project helps large language model (LLM) developers and MLOps engineers speed up how quickly their LLMs generate responses, especially when dealing with very long input texts. It takes an existing Transformer-based LLM, applies an optimized attention mechanism, and outputs the same LLM capable of much faster inference with minimal accuracy loss. This is for professionals building and deploying LLMs who need to serve long-context applications efficiently.
392 stars. No commits in the last 6 months.
Use this if you are running Transformer-based LLMs that process very long text inputs and you need to significantly improve their response generation speed without extensive retraining.
Not ideal if your LLM applications primarily deal with short text inputs or if you are not working with Transformer architectures.
Stars
392
Forks
21
Language
Python
License
Apache-2.0
Category
Last pushed
Jun 25, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/NVIDIA/Star-Attention"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
fla-org/flash-linear-attention
🚀 Efficient implementations of state-of-the-art linear attention models
thu-ml/SageAttention
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x...
thu-ml/SpargeAttn
[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.
fla-org/flame
🔥 A minimal training framework for scaling FLA models
foundation-model-stack/fms-fsdp
🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for...