NVIDIA/Star-Attention

Efficient LLM Inference over Long Sequences

40
/ 100
Emerging

This project helps large language model (LLM) developers and MLOps engineers speed up how quickly their LLMs generate responses, especially when dealing with very long input texts. It takes an existing Transformer-based LLM, applies an optimized attention mechanism, and outputs the same LLM capable of much faster inference with minimal accuracy loss. This is for professionals building and deploying LLMs who need to serve long-context applications efficiently.

392 stars. No commits in the last 6 months.

Use this if you are running Transformer-based LLMs that process very long text inputs and you need to significantly improve their response generation speed without extensive retraining.

Not ideal if your LLM applications primarily deal with short text inputs or if you are not working with Transformer architectures.

LLM deployment model serving inference optimization natural language processing AI infrastructure
Stale 6m No Package No Dependents
Maintenance 2 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 12 / 25

How are scores calculated?

Stars

392

Forks

21

Language

Python

License

Apache-2.0

Last pushed

Jun 25, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/NVIDIA/Star-Attention"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.