erans/selfhostllm

A web-based calculator for estimating GPU memory requirements and maximum concurrent requests for self-hosted LLM inference.

/ 100

Emerging

This tool helps you understand how many simultaneous requests your GPU setup can handle when running large language models (LLMs) on your own hardware. You input your GPU's memory, the LLM you want to use, and any quantization settings, and it estimates the maximum number of concurrent users or tasks your system can support. This is designed for IT professionals, ML engineers, or researchers who are deploying LLMs locally and need to plan their hardware resources.

Use this if you need to estimate the GPU memory required and the maximum concurrent users for self-hosting a large language model, ensuring efficient resource allocation.

Not ideal if you are using cloud-based LLM APIs or do not manage your own GPU infrastructure for inference.

LLM-deployment GPU-resource-planning on-premise-AI ML-infrastructure inference-scaling

No Package No Dependents

Maintenance 10 / 25

Adoption 7 / 25

Maturity 15 / 25

Community 10 / 25

How are scores calculated?

Stars

Forks

Language

HTML

License

MIT

Higher-rated alternatives

vllm-project/vllm-ascend

Community maintained hardware plugin for vLLM on Ascend

kvcache-ai/Mooncake

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

SemiAnalysisAI/InferenceX

Open Source Continuous Inference Benchmarking Qwen3.5, DeepSeek, GPTOSS - GB200 NVL72 vs MI355X...

sophgo/tpu-mlir

Machine learning compiler based on MLIR for Sophgo TPU.

uccl-project/uccl

UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache...

Explore LLM Tools

All categories Trending LLM Tool directory Insights