efeslab/Nanoflow

A throughput-oriented high-performance serving framework for LLMs

/ 100

Emerging

Nanoflow is designed for engineers and MLOps professionals who need to efficiently run large language models (LLMs) in production. It takes common LLM models (like Llama 2, Llama 3, or Qwen2) and serves them faster and more efficiently than other systems. The output is a highly responsive LLM service that can handle more user requests with the same hardware.

949 stars.

Use this if you need to serve large language models to many users and want to maximize the number of requests your existing GPU hardware can handle.

Not ideal if you are a data scientist performing one-off model experiments or if you are serving models other than large language models.

LLM-serving model-deployment production-AI machine-learning-operations AI-infrastructure

No License No Package No Dependents

Maintenance 6 / 25

Adoption 10 / 25

Maturity 8 / 25

Community 14 / 25

How are scores calculated?

Stars

949

Forks

Language

Jupyter Notebook

License

—

Higher-rated alternatives

vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

sgl-project/sglang

SGLang is a high-performance serving framework for large language models and multimodal models.

alibaba/MNN

MNN: A blazing-fast, lightweight inference engine battle-tested by Alibaba, powering...

xorbitsai/inference

Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source,...

tensorzero/tensorzero

TensorZero is an open-source stack for industrial-grade LLM applications. It unifies an LLM...

Explore Transformer Models

All categories Trending Transformer directory Insights