efeslab/Nanoflow
A throughput-oriented high-performance serving framework for LLMs
Nanoflow is designed for engineers and MLOps professionals who need to efficiently run large language models (LLMs) in production. It takes common LLM models (like Llama 2, Llama 3, or Qwen2) and serves them faster and more efficiently than other systems. The output is a highly responsive LLM service that can handle more user requests with the same hardware.
949 stars.
Use this if you need to serve large language models to many users and want to maximize the number of requests your existing GPU hardware can handle.
Not ideal if you are a data scientist performing one-off model experiments or if you are serving models other than large language models.
Stars
949
Forks
47
Language
Jupyter Notebook
License
—
Category
Last pushed
Oct 29, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/efeslab/Nanoflow"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
sgl-project/sglang
SGLang is a high-performance serving framework for large language models and multimodal models.
alibaba/MNN
MNN: A blazing-fast, lightweight inference engine battle-tested by Alibaba, powering...
xorbitsai/inference
Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source,...
tensorzero/tensorzero
TensorZero is an open-source stack for industrial-grade LLM applications. It unifies an LLM...