EfficientMoE/MoE-Infinity
PyTorch library for cost-effective, fast and easy serving of MoE models.
This tool helps machine learning engineers and researchers serve large Mixture-of-Experts (MoE) models, like those used for chatbots and language translation, more efficiently. It takes HuggingFace-compatible MoE models as input and outputs generated text with significantly reduced latency and memory requirements, even on less powerful GPUs. The ideal user is someone managing the deployment of large language models.
288 stars.
Use this if you need to run large Mixture-of-Experts models on GPUs with limited memory and want to achieve faster inference times compared to other serving solutions.
Not ideal if you require distributed inference across multiple machines, as this open-source version currently focuses on single or multi-GPU inference on a single node.
Stars
288
Forks
25
Language
Python
License
Apache-2.0
Category
Last pushed
Mar 03, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/EfficientMoE/MoE-Infinity"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related models
raymin0223/mixture_of_recursions
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation...
AviSoori1x/makeMoE
From scratch implementation of a sparse mixture of experts language model inspired by Andrej...
thu-nics/MoA
[CoLM'25] The official implementation of the paper
jaisidhsingh/pytorch-mixtures
One-stop solutions for Mixture of Expert modules in PyTorch.
CASE-Lab-UMD/Unified-MoE-Compression
The official implementation of the paper "Towards Efficient Mixture of Experts: A Holistic Study...