NimbleEdge/sparse_transformers

Sparse Inferencing for transformer based LLMs

/ 100

Emerging

This project helps developers and MLOps engineers make large language models (LLMs) run faster and use less memory when generating text. It takes a standard LLM and applies advanced sparsity techniques to produce the same LLM, but with significantly improved speed for the first token and subsequent tokens, along with reduced memory footprint. It's for those deploying or serving LLMs who need to optimize performance and resource usage.

216 stars. No commits in the last 6 months.

Use this if you are deploying transformer-based large language models and need to reduce memory consumption and significantly increase the speed of text generation on CPU, with GPU optimization planned.

Not ideal if you are working with non-transformer models, require immediate GPU performance benefits for sparse inference (currently in progress), or are not comfortable working with C++ extensions in Python.

LLM deployment model optimization inference acceleration deep learning operations resource management

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 11 / 25

How are scores calculated?

Stars

216

Forks

Language

Python

License

Apache-2.0

Higher-rated alternatives

fla-org/flash-linear-attention

🚀 Efficient implementations of state-of-the-art linear attention models

thu-ml/SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x...

thu-ml/SpargeAttn

[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.

fla-org/flame

🔥 A minimal training framework for scaling FLA models

foundation-model-stack/fms-fsdp

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for...

Explore Transformer Models

All categories Trending Transformer directory Insights