ByteDance-Seed/FlexPrefill
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
This project helps AI developers and researchers make large language models (LLMs) run faster and more efficiently when processing very long texts. It takes an existing LLM, such as LLaMA, and applies a smart attention mechanism to optimize how it handles long inputs. The output is the same text generation capabilities from your LLM, but with reduced computational cost and faster inference times.
164 stars. No commits in the last 6 months.
Use this if you are a developer working with large language models and need to improve their inference speed and resource efficiency, especially when handling lengthy documents or conversations.
Not ideal if you are looking for a pre-trained model for direct use or if your main concern isn't the computational efficiency of long-sequence inference.
Stars
164
Forks
9
Language
Python
License
Apache-2.0
Category
Last pushed
Oct 13, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/ByteDance-Seed/FlexPrefill"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
EfficientMoE/MoE-Infinity
PyTorch library for cost-effective, fast and easy serving of MoE models.
raymin0223/mixture_of_recursions
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation...
AviSoori1x/makeMoE
From scratch implementation of a sparse mixture of experts language model inspired by Andrej...
thu-nics/MoA
[CoLM'25] The official implementation of the paper
jaisidhsingh/pytorch-mixtures
One-stop solutions for Mixture of Expert modules in PyTorch.