erans/selfhostllm
A web-based calculator for estimating GPU memory requirements and maximum concurrent requests for self-hosted LLM inference.
This tool helps you understand how many simultaneous requests your GPU setup can handle when running large language models (LLMs) on your own hardware. You input your GPU's memory, the LLM you want to use, and any quantization settings, and it estimates the maximum number of concurrent users or tasks your system can support. This is designed for IT professionals, ML engineers, or researchers who are deploying LLMs locally and need to plan their hardware resources.
Use this if you need to estimate the GPU memory required and the maximum concurrent users for self-hosting a large language model, ensuring efficient resource allocation.
Not ideal if you are using cloud-based LLM APIs or do not manage your own GPU infrastructure for inference.
Stars
37
Forks
4
Language
HTML
License
MIT
Category
Last pushed
Feb 25, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/erans/selfhostllm"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
vllm-project/vllm-ascend
Community maintained hardware plugin for vLLM on Ascend
kvcache-ai/Mooncake
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
SemiAnalysisAI/InferenceX
Open Source Continuous Inference Benchmarking Qwen3.5, DeepSeek, GPTOSS - GB200 NVL72 vs MI355X...
sophgo/tpu-mlir
Machine learning compiler based on MLIR for Sophgo TPU.
uccl-project/uccl
UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache...