ByteDance-Seed/FlexPrefill

Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference

/ 100

Emerging

This project helps AI developers and researchers make large language models (LLMs) run faster and more efficiently when processing very long texts. It takes an existing LLM, such as LLaMA, and applies a smart attention mechanism to optimize how it handles long inputs. The output is the same text generation capabilities from your LLM, but with reduced computational cost and faster inference times.

164 stars. No commits in the last 6 months.

Use this if you are a developer working with large language models and need to improve their inference speed and resource efficiency, especially when handling lengthy documents or conversations.

Not ideal if you are looking for a pre-trained model for direct use or if your main concern isn't the computational efficiency of long-sequence inference.

AI-development LLM-optimization model-inference natural-language-processing computational-efficiency

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 10 / 25

How are scores calculated?

Stars

164

Forks

Language

Python

License

Apache-2.0

Higher-rated alternatives

EfficientMoE/MoE-Infinity

PyTorch library for cost-effective, fast and easy serving of MoE models.

raymin0223/mixture_of_recursions

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation...

AviSoori1x/makeMoE

From scratch implementation of a sparse mixture of experts language model inspired by Andrej...

thu-nics/MoA

[CoLM'25] The official implementation of the paper

jaisidhsingh/pytorch-mixtures

One-stop solutions for Mixture of Expert modules in PyTorch.

Explore Transformer Models

All categories Trending Transformer directory Insights