NimbleEdge/sparse_transformers

Sparse Inferencing for transformer based LLMs

39
/ 100
Emerging

This project helps developers and MLOps engineers make large language models (LLMs) run faster and use less memory when generating text. It takes a standard LLM and applies advanced sparsity techniques to produce the same LLM, but with significantly improved speed for the first token and subsequent tokens, along with reduced memory footprint. It's for those deploying or serving LLMs who need to optimize performance and resource usage.

216 stars. No commits in the last 6 months.

Use this if you are deploying transformer-based large language models and need to reduce memory consumption and significantly increase the speed of text generation on CPU, with GPU optimization planned.

Not ideal if you are working with non-transformer models, require immediate GPU performance benefits for sparse inference (currently in progress), or are not comfortable working with C++ extensions in Python.

LLM deployment model optimization inference acceleration deep learning operations resource management
Stale 6m No Package No Dependents
Maintenance 2 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 11 / 25

How are scores calculated?

Stars

216

Forks

12

Language

Python

License

Apache-2.0

Last pushed

Aug 11, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/NimbleEdge/sparse_transformers"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.