nanowell/Q-Sparse-LLM

My Implementation of Q-Sparse: All Large Language Models can be Fully Sparsely-Activated

/ 100

Experimental

This project helps machine learning engineers and researchers optimize large language models (LLMs) for deployment. It takes existing transformer-based LLMs and applies techniques to make them run more efficiently. The output is a functionally similar LLM that requires less computational power and memory, making it suitable for environments with resource constraints.

No commits in the last 6 months.

Use this if you need to reduce the computational cost and memory footprint of large language models while maintaining their performance, especially for deployment on resource-limited hardware.

Not ideal if you are looking for a pre-trained, ready-to-use LLM for direct application without modification or advanced optimization needs.

large-language-models model-optimization edge-ai efficient-inference machine-learning-deployment

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 7 / 25

Maturity 16 / 25

Community 3 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Higher-rated alternatives

fla-org/flash-linear-attention

🚀 Efficient implementations of state-of-the-art linear attention models

thu-ml/SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x...

thu-ml/SpargeAttn

[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.

fla-org/flame

🔥 A minimal training framework for scaling FLA models

foundation-model-stack/fms-fsdp

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for...

Explore Transformer Models

All categories Trending Transformer directory Insights