ByteDance-Seed/FlexPrefill

Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference

38
/ 100
Emerging

This project helps AI developers and researchers make large language models (LLMs) run faster and more efficiently when processing very long texts. It takes an existing LLM, such as LLaMA, and applies a smart attention mechanism to optimize how it handles long inputs. The output is the same text generation capabilities from your LLM, but with reduced computational cost and faster inference times.

164 stars. No commits in the last 6 months.

Use this if you are a developer working with large language models and need to improve their inference speed and resource efficiency, especially when handling lengthy documents or conversations.

Not ideal if you are looking for a pre-trained model for direct use or if your main concern isn't the computational efficiency of long-sequence inference.

AI-development LLM-optimization model-inference natural-language-processing computational-efficiency
Stale 6m No Package No Dependents
Maintenance 2 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 10 / 25

How are scores calculated?

Stars

164

Forks

9

Language

Python

License

Apache-2.0

Last pushed

Oct 13, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/ByteDance-Seed/FlexPrefill"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.