ServerlessLLM/ServerlessLLM
Serverless LLM Serving for Everyone.
This project helps machine learning engineers and MLOps specialists efficiently deploy and manage multiple large language models (LLMs) and embedding models on shared GPU resources. It takes various LLM model checkpoints as input and outputs a fast, cost-effective serving cluster with an OpenAI-compatible API, allowing for quick deployment, querying, and even fine-tuning of models. This is for users who need to serve many AI models without incurring high hardware costs.
663 stars.
Use this if you need to serve multiple large language models or embedding models on a single GPU or a small cluster, significantly reducing infrastructure costs and improving model loading speed.
Not ideal if you only need to run a single LLM on dedicated hardware or are not concerned with optimizing GPU utilization and model loading times.
Stars
663
Forks
68
Language
Python
License
Apache-2.0
Category
Last pushed
Mar 06, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/ServerlessLLM/ServerlessLLM"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related models
PaddlePaddle/FastDeploy
High-performance Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle
mlc-ai/mlc-llm
Universal LLM Deployment Engine with ML Compilation
skyzh/tiny-llm
A course of learning LLM inference serving on Apple Silicon for systems engineers: build a tiny...
AXERA-TECH/ax-llm
Explore LLM model deployment based on AXera's AI chips
AmpereComputingAI/ampere_model_library
AML's goal is to make benchmarking of various AI architectures on Ampere CPUs a pleasurable experience :)