slwang-ustc/nano-vllm-v1
Nano vLLM with vLLM v1's request scheduling strategy and chunked prefill
This is a lightweight tool for developers who need to run large language models (LLMs) efficiently. It takes your LLM and a set of input prompts, then generates text responses with high throughput and low latency. Software engineers building LLM-powered applications or services, especially those managing inference at scale, would use this.
Use this if you are a software engineer deploying large language models and need a highly performant, yet readable, inference server solution with advanced scheduling features.
Not ideal if you are an end-user looking for a no-code solution to interact with LLMs, or if you don't have experience with Python development and model deployment.
Stars
61
Forks
22
Language
Python
License
MIT
Category
Last pushed
Jan 26, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/slwang-ustc/nano-vllm-v1"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
sgl-project/sglang
SGLang is a high-performance serving framework for large language models and multimodal models.
alibaba/MNN
MNN: A blazing-fast, lightweight inference engine battle-tested by Alibaba, powering...
xorbitsai/inference
Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source,...
tensorzero/tensorzero
TensorZero is an open-source stack for industrial-grade LLM applications. It unifies an LLM...