NimbleEdge/sparse_transformers
Sparse Inferencing for transformer based LLMs
This project helps developers and MLOps engineers make large language models (LLMs) run faster and use less memory when generating text. It takes a standard LLM and applies advanced sparsity techniques to produce the same LLM, but with significantly improved speed for the first token and subsequent tokens, along with reduced memory footprint. It's for those deploying or serving LLMs who need to optimize performance and resource usage.
216 stars. No commits in the last 6 months.
Use this if you are deploying transformer-based large language models and need to reduce memory consumption and significantly increase the speed of text generation on CPU, with GPU optimization planned.
Not ideal if you are working with non-transformer models, require immediate GPU performance benefits for sparse inference (currently in progress), or are not comfortable working with C++ extensions in Python.
Stars
216
Forks
12
Language
Python
License
Apache-2.0
Category
Last pushed
Aug 11, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/NimbleEdge/sparse_transformers"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
fla-org/flash-linear-attention
🚀 Efficient implementations of state-of-the-art linear attention models
thu-ml/SageAttention
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x...
thu-ml/SpargeAttn
[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.
fla-org/flame
🔥 A minimal training framework for scaling FLA models
foundation-model-stack/fms-fsdp
🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for...