vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

/ 100

Verified

This project helps machine learning engineers and developers efficiently deploy and serve large language models (LLMs) in production environments. You provide your chosen LLM and receive a high-throughput, memory-optimized inference service ready for use. It's designed for ML engineers, MLOps specialists, and developers who need to integrate LLM capabilities into applications without sacrificing speed or cost efficiency.

73,007 stars. Used by 46 other packages. Actively maintained with 912 commits in the last 30 days. Available on PyPI.

Use this if you need to run large language models at scale, serve many user requests simultaneously, and minimize the computational resources required.

Not ideal if you are a casual user looking for a pre-built chatbot or a simple API wrapper without needing to manage the underlying model serving infrastructure.

LLM deployment model serving AI infrastructure MLOps API development

Maintenance 22 / 25

Adoption 15 / 25

Maturity 25 / 25

Community 25 / 25

How are scores calculated?

Stars

73,007

Forks

14,312

Language

Python

License

Apache-2.0

Community Discussion

Nano-vLLM: How a vLLM-style inference engine works 271 points · 27 comments · Feb 2026 vLLM-MLX – Run LLMs on Mac at 464 tok/s 33 points · 3 comments · Jan 2026 Surpassing vLLM with a Generated Inference Stack 32 points · 12 comments · Mar 2026

Recent Releases

v0.19.0 03 Apr 2026 v0.18.1 31 Mar 2026 v0.18.0 20 Mar 2026 v0.17.1 11 Mar 2026 v0.17.0 07 Mar 2026

Compare

vllm and sglang vllm and MNN vllm and inference vllm and rtp-llm vllm and xllm vllm and gpustack vllm and LightLLM vllm and Automodel vllm and ZhiLight vllm and PowerInfer

Related models

sgl-project/sglang

SGLang is a high-performance serving framework for large language models and multimodal models.

alibaba/MNN

MNN: A blazing-fast, lightweight inference engine battle-tested by Alibaba, powering...

xorbitsai/inference

Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source,...

tensorzero/tensorzero

TensorZero is an open-source stack for industrial-grade LLM applications. It unifies an LLM...

tenstorrent/tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.

Explore Transformer Models

All categories Trending Transformer directory Insights