alibaba/InferSim

A Lightweight LLM Inference Performance Simulator

/ 100

Established

This tool helps AI infrastructure engineers and model developers predict the performance of large language models (LLMs) on different GPU setups. You input your LLM's architecture details and GPU specifications, and it outputs key performance metrics like time-to-first-token (TTFT), time-per-output-token (TPOT), and overall throughput (tokens/GPU/second). This is for anyone involved in deploying and optimizing LLM inference systems.

Use this if you need to understand how changes in LLM model design or GPU hardware will impact inference speed and efficiency, especially for multi-GPU or multi-node deployments.

Not ideal if you are a data scientist primarily focused on model training or fine-tuning, and not directly involved in the system-level deployment and performance optimization of LLM inference.

LLM deployment AI infrastructure GPU optimization model performance analysis system co-design

No Package No Dependents

Maintenance 10 / 25

Adoption 8 / 25

Maturity 15 / 25

Community 19 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

Apache-2.0

Related models

vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

sgl-project/sglang

SGLang is a high-performance serving framework for large language models and multimodal models.

alibaba/MNN

MNN: A blazing-fast, lightweight inference engine battle-tested by Alibaba, powering...

xorbitsai/inference

Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source,...

tensorzero/tensorzero

TensorZero is an open-source stack for industrial-grade LLM applications. It unifies an LLM...

Explore Transformer Models

All categories Trending Transformer directory Insights