NVIDIA/Star-Attention

Efficient LLM Inference over Long Sequences

/ 100

Emerging

This project helps large language model (LLM) developers and MLOps engineers speed up how quickly their LLMs generate responses, especially when dealing with very long input texts. It takes an existing Transformer-based LLM, applies an optimized attention mechanism, and outputs the same LLM capable of much faster inference with minimal accuracy loss. This is for professionals building and deploying LLMs who need to serve long-context applications efficiently.

392 stars. No commits in the last 6 months.

Use this if you are running Transformer-based LLMs that process very long text inputs and you need to significantly improve their response generation speed without extensive retraining.

Not ideal if your LLM applications primarily deal with short text inputs or if you are not working with Transformer architectures.

LLM deployment model serving inference optimization natural language processing AI infrastructure

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 12 / 25

How are scores calculated?

Stars

392

Forks

Language

Python

License

Apache-2.0

Compare

Star-Attention and flash-linear-attention

Higher-rated alternatives

fla-org/flash-linear-attention

🚀 Efficient implementations of state-of-the-art linear attention models

thu-ml/SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x...

thu-ml/SpargeAttn

[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.

fla-org/flame

🔥 A minimal training framework for scaling FLA models

foundation-model-stack/fms-fsdp

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for...

Explore Transformer Models

All categories Trending Transformer directory Insights