ServerlessLLM/ServerlessLLM

Serverless LLM Serving for Everyone.

54
/ 100
Established

This project helps machine learning engineers and MLOps specialists efficiently deploy and manage multiple large language models (LLMs) and embedding models on shared GPU resources. It takes various LLM model checkpoints as input and outputs a fast, cost-effective serving cluster with an OpenAI-compatible API, allowing for quick deployment, querying, and even fine-tuning of models. This is for users who need to serve many AI models without incurring high hardware costs.

663 stars.

Use this if you need to serve multiple large language models or embedding models on a single GPU or a small cluster, significantly reducing infrastructure costs and improving model loading speed.

Not ideal if you only need to run a single LLM on dedicated hardware or are not concerned with optimizing GPU utilization and model loading times.

LLM deployment MLOps AI model serving GPU resource management inference optimization
No Package No Dependents
Maintenance 10 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 18 / 25

How are scores calculated?

Stars

663

Forks

68

Language

Python

License

Apache-2.0

Last pushed

Mar 06, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/ServerlessLLM/ServerlessLLM"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.