ServerlessLLM/ServerlessLLM

Serverless LLM Serving for Everyone.

/ 100

Established

This project helps machine learning engineers and MLOps specialists efficiently deploy and manage multiple large language models (LLMs) and embedding models on shared GPU resources. It takes various LLM model checkpoints as input and outputs a fast, cost-effective serving cluster with an OpenAI-compatible API, allowing for quick deployment, querying, and even fine-tuning of models. This is for users who need to serve many AI models without incurring high hardware costs.

663 stars.

Use this if you need to serve multiple large language models or embedding models on a single GPU or a small cluster, significantly reducing infrastructure costs and improving model loading speed.

Not ideal if you only need to run a single LLM on dedicated hardware or are not concerned with optimizing GPU utilization and model loading times.

LLM deployment MLOps AI model serving GPU resource management inference optimization

No Package No Dependents

Maintenance 10 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 18 / 25

How are scores calculated?

Stars

663

Forks

Language

Python

License

Apache-2.0

Related models

PaddlePaddle/FastDeploy

High-performance Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle

mlc-ai/mlc-llm

Universal LLM Deployment Engine with ML Compilation

skyzh/tiny-llm

A course of learning LLM inference serving on Apple Silicon for systems engineers: build a tiny...

AXERA-TECH/ax-llm

Explore LLM model deployment based on AXera's AI chips

AmpereComputingAI/ampere_model_library

AML's goal is to make benchmarking of various AI architectures on Ampere CPUs a pleasurable experience :)

Explore Transformer Models

All categories Trending Transformer directory Insights