LLM Inference Engines Transformer Models

Optimized inference engines and serving systems for deploying and running large language models efficiently. Focuses on throughput, latency, memory optimization, and production deployment. Does NOT include training frameworks, fine-tuning methods, quantization techniques, or model architecture implementations.

There are 164 llm inference engines models tracked. 7 score above 70 (verified tier). The highest-rated is vllm-project/vllm at 87/100 with 73,007 stars. 10 of the top 10 are actively maintained.

Get all 164 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=llm-inference-engines&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Model	Score	Tier	Stars	Language
1	vllm-project/vllm A high-throughput and memory-efficient inference and serving engine for LLMs	87	Verified	73,007	Python
2	sgl-project/sglang SGLang is a high-performance serving framework for large language models and...	87	Verified	24,410	Python
3	alibaba/MNN MNN: A blazing-fast, lightweight inference engine battle-tested by Alibaba,...	80	Verified	14,526	C++
4	xorbitsai/inference Swap GPT for any LLM by changing a single line of code. Xinference lets you...	76	Verified	9,129	Python
5	tensorzero/tensorzero TensorZero is an open-source stack for industrial-grade LLM applications. It...	76	Verified	11,080	Rust
6	tenstorrent/tt-metal :metal: TT-NN operator library, and TT-Metalium low level kernel programming model.	73	Verified	1,379	C++
7	alibaba/rtp-llm RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.	70	Verified	1,065	Cuda
8	jd-opensource/xllm A high-performance inference engine for LLMs, optimized for diverse AI accelerators.	69	Established	1,081	C++
9	gpustack/gpustack Performance-optimized AI inference on your GPUs. Unlock superior throughput...	68	Established	4,630	Python
10	ARahim3/mlx-tune Bringing the Unsloth experience to Mac users via Apple's MLX framework	68	Established	733	Python
11	InternLM/lmdeploy LMDeploy is a toolkit for compressing, deploying, and serving LLMs.	67	Established	7,680	Python
12	ModelTC/LightLLM LightLLM is a Python-based LLM (Large Language Model) inference and serving...	65	Established	3,944	Python
13	FastFlowLM/FastFlowLM Run LLMs on AMD Ryzen™ AI NPUs in minutes. Just like Ollama - but...	62	Established	942	C++
14	NexaAI/nexa-sdk Run frontier LLMs and VLMs with day-0 model support across GPU, NPU, and...	60	Established	7,797	Kotlin
15	NVIDIA-NeMo/Automodel Pytorch Distributed native training library for LLMs/VLMs with OOTB Hugging...	59	Established	366	Python
16	zhihu/ZhiLight A highly optimized LLM inference acceleration engine for Llama and its variants.	59	Established	905	C++
17	underneathall/pinferencia Python + Inference - Model Deployment library in Python. Simplest model...	56	Established	545	Python
18	ai-decentralized/BloomBee Decentralized LLMs fine-tuning and inference with offloading	55	Established	111	Python
19	bigscience-workshop/petals 🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x...	54	Established	9,997	Python
20	toverainc/willow-inference-server Open source, local, and self-hosted highly optimized language inference...	54	Established	495	Python
21	Tiiny-AI/PowerInfer High-speed Large Language Model Serving for Local Deployment	54	Established	8,808	C++
22	GeeeekExplorer/nano-vllm Nano vLLM	53	Established	12,189	Python
23	livepeer/ai-runner Inference runtime for running different batch and real-time AI pipelines.	53	Established	25	Python
24	alibaba/InferSim A Lightweight LLM Inference Performance Simulator	52	Established	65	Python
25	microsoft/vidur A large-scale simulation framework for LLM inference	52	Established	547	Python
26	zhenye234/LLaSA_training LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis	52	Established	659	Python
27	AI-Hypercomputer/JetStream JetStream is a throughput and memory optimized engine for LLM inference on...	51	Established	415	Python
28	vitoplantamura/OnnxStream Lightweight inference library for ONNX files, written in C++. It can run...	51	Established	2,031	C++
29	kennethleungty/Llama-2-Open-Source-LLM-CPU-Inference Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q&A	51	Established	974	Python
30	microsoft/sarathi-serve A low-latency & high-throughput serving engine for LLMs	50	Established	482	Python
31	Troyanovsky/Local-LLM-Comparison-Colab-UI Compare the performance of different LLM that can be deployed locally on...	50	Established	1,100	Jupyter Notebook
32	jina-ai/rungpt An open-source cloud-native of large multi-modal models (LMMs) serving framework.	50	Established	165	Python
33	Deep-Spark/DeepSparkInference DeepSparkInference has selected 216 inference models of both small and large...	49	Emerging	28	Python
34	higgsfield-ai/higgsfield Fault-tolerant, highly scalable GPU orchestration, and a machine learning...	49	Emerging	3,558	Jupyter Notebook
35	intel/ipex-llm Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM,...	49	Emerging	8,724	Python
36	slwang-ustc/nano-vllm-v1 Nano vLLM with vLLM v1's request scheduling strategy and chunked prefill	48	Emerging	61	Python
37	SearchSavior/OpenArc Inference engine for Intel devices. Serve LLMs, VLMs, Whisper, Kokoro-TTS,...	48	Emerging	341	Python
38	vectorch-ai/ScaleLLM A high-performance inference system for large language models, designed for...	47	Emerging	491	C++
39	bytedance/byteir A model compilation solution for various hardware	46	Emerging	465	MLIR
40	MegEngine/InferLLM a lightweight LLM model inference framework	46	Emerging	747	C++
41	RWKV/rwkv.cpp INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model	45	Emerging	1,563	C++
42	inclusionAI/asystem-awex A high-performance RL training-inference weight synchronization framework,...	45	Emerging	138	Python
43	powerserve-project/PowerServe High-speed and easy-use LLM serving framework for local deployment	44	Emerging	146	C++
44	interestingLSY/swiftLLM A tiny yet powerful LLM inference system tailored for researching purpose....	44	Emerging	320	Python
45	andrewkchan/deepseek.cpp CPU inference for the DeepSeek family of large language models in C++	44	Emerging	315	C++
46	SqueezeAILab/LLMCompiler [ICML 2024] LLMCompiler: An LLM Compiler for Parallel Function Calling	44	Emerging	1,828	Python
47	1b5d/llm-api Run any Large Language Model behind a unified API	44	Emerging	171	Python
48	AI-Hypercomputer/jetstream-pytorch PyTorch/XLA integration with JetStream (https://github.com/google/JetStream)...	44	Emerging	79	Python
49	PureBee/purebee A GPU defined in software. Runs Llama 3.2 1B at 3.6 tok/sec. Zero dependencies.	43	Emerging	22	JavaScript
50	modelscope/dash-infer DashInfer is a native LLM inference engine aiming to deliver...	43	Emerging	273	C
51	jankais3r/LLaMA_MPS Run LLaMA (and Stanford-Alpaca) inference on Apple Silicon GPUs.	42	Emerging	585	Python
52	Azure99/BlossomData A fluent, scalable, and easy-to-use LLM data processing framework.	42	Emerging	28	Python
53	chenmozhijin/BSRoformer.cpp GGML-based C++ inference for BS Roformer/Mel-Band-Roformer vocal separation...	40	Emerging	8	C++
54	zejia-lin/BulletServe Boosting GPU utilization for LLM serving via dynamic spatial-temporal...	40	Emerging	37	Python
55	aniketmaurya/llm-inference Large Language Model (LLM) Inference API and Chatbot	40	Emerging	127	Python
56	James-QiuHaoran/LLM-serving-with-proxy-models Efficient Interactive LLM Serving with Proxy Model-based Sequence Length...	39	Emerging	49	Jupyter Notebook
57	riccardomusmeci/mlx-llm Large Language Models (LLMs) applications and tools running on Apple Silicon...	39	Emerging	459	Python
58	MrYxJ/calculate-flops.pytorch The calflops is designed to calculate FLOPs、MACs and Parameters in all...	39	Emerging	927	Python
59	hpcaitech/SwiftInfer Efficient AI Inference & Serving	39	Emerging	480	Python
60	argonne-lcf/LLM-Inference-Bench LLM-Inference-Bench	38	Emerging	60	Jupyter Notebook
61	toyaix/TritonLLM LLM Inference via Triton (Flexible & Modular): Focused on Kernel...	38	Emerging	76	Python
62	jdaln/dgx-spark-inference-stack Serve the home! Inference stack for your Nvidia DGX Spark aka the Grace...	38	Emerging	26	JavaScript
63	AmpereComputingAI/llama.cpp Ampere optimized llama.cpp	38	Emerging	33	Python
64	TrevTron/indiedroid-nova-llm Running Llama 3.1 8B and other LLMs on RK3588 NPU - benchmarks and setup guides	38	Emerging	3	Python
65	efeslab/Nanoflow A throughput-oriented high-performance serving framework for LLMs	38	Emerging	949	Jupyter Notebook
66	thruthseeker/LionLock_FDE_OSS Open source fatigue detection engine for large language models with trust overlay	38	Emerging	3	Python
67	knagrecha/saturn Saturn accelerates the training of large-scale deep learning models with a...	37	Emerging	24	Python
68	zRzRzRzRzRzRzR/lm-fly 大模型推理框架加速，让 LLM 飞起来	37	Emerging	24	Python
69	CoderLSF/fast-llama Runs LLaMA with Extremely HIGH speed	37	Emerging	95	C++
70	rbitr/llm.f90 LLM inference in Fortran	37	Emerging	64	Fortran
71	ShinoharaHare/LLM-Training A distributed training framework for large language models powered by Lightning.	37	Emerging	24	Python
72	invergent-ai/surogate Insanely fast LLM pre-training and fine-tuning for modern NVIDIA GPUs....	37	Emerging	114	C++
73	andrewkchan/yalm Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O	37	Emerging	557	C++
74	gotzmann/booster Booster - open accelerator for LLM models. Better inference and debugging...	36	Emerging	167	C++
75	m-horky/sllm Tools using small Large Language Models	36	Emerging	4	Python
76	m0dulo/InferSpore 🌱 A fully independent Large Language Model (LLM) inference engine, built...	36	Emerging	32	Cuda
77	moeru-ai/demodel 🚀🛸 Easily boost the speed of pulling your models and datasets from various...	36	Emerging	10	Go
78	alibaba/easydist Automated Parallelization System and Infrastructure for Multiple Ecosystems	35	Emerging	82	Python
79	lucasjinreal/Namo-R1 A CPU Realtime VLM in 500M. Surpassed Moondream2 and SmolVLM. Training from...	35	Emerging	252	Python
80	nareshis21/Truelarge-RT Android inference engine running 20B+ parameter LLMs on 4GB-8GB RAM devices....	34	Emerging	9	Kotlin
81	vivy-yi/awesome-llm-training-inference Curated list of LLM training and inference frameworks, tools, and resources....	34	Emerging	1	—
82	RahulSChand/gpu_poor Calculate token/s & GPU memory requirement for any LLM. Supports...	34	Emerging	1,396	JavaScript
83	yingding/applyllm A python package for applying LLM with LangChain and Hugging Face on local...	33	Emerging	2	Jupyter Notebook
84	gunnarnordqvist/opencode-context-filter Transparent HTTP proxy that automatically filters repository context for...	33	Emerging	2	Python
85	AshishGautamX/K8s-LLM-Scheduler An intelligent Kubernetes scheduler powered by Meta's Llama-3.3-70B model...	33	Emerging	2	Python
86	psmarter/mini-infer A high-performance LLM inference engine with PagedAttention \|...	33	Emerging	61	Python
87	winstxnhdw/llm-api A fast CPU-based API for Qwen 2.5 using CTranslate2, hosted on Hugging Face Spaces.	32	Emerging	1	Python
88	dengls24/LLM-para Analyze LLM inference: FLOPs, memory, Roofline model. Supports GQA, MoE,...	32	Emerging	10	Python
89	kennethleungty/DeepSeek-R1-Ollama-Simple-Evals Run and Evaluate DeepSeek-R1 Distilled Models Locally with Ollama and...	32	Emerging	2	Jupyter Notebook
90	HyperMink/inferenceable Scalable AI Inference Server for CPU and GPU with Node.js \| Utilizes...	32	Emerging	15	JavaScript
91	tommasocerruti/detllm Deterministic-mode checks for LLM inference: measure run/batch variance,...	32	Emerging	18	Python
92	ybubnov/metalchat Pure C++23 Llama inference for Apple Silicon chips	32	Emerging	19	C++
93	Relaxed-System-Lab/HexGen [ICML 2024] Serving LLMs on heterogeneous decentralized clusters.	31	Emerging	34	Python
94	titanml/takeoff-community TitanML Takeoff Server is an optimization, compression and deployment...	31	Emerging	114	—
95	bpevangelista/vfastml Inference and Training Engine for LLMs, Image2Image and Other Models	31	Emerging	3	Python
96	KevinLee1110/dynamic-batching The official repo for the paper "Optimizing LLM Inference Throughput via...	31	Emerging	17	—
97	harleyszhang/llm_counts llm theoretical performance analysis tools and support params, flops, memory...	31	Emerging	115	Python
98	ToddThomson/Mila Achilles Mila Deep Neural Network library provides a comprehensive API to...	30	Emerging	7	C++
99	VPanjeta/PyLLaMa-CPU Fast LLaMa inference on CPU using llama.cpp for Python	29	Experimental	9	C
100	BenChaliah/NVFP4-on-4090-vLLM AdaLLM is an NVFP4-first inference runtime for Ada Lovelace (RTX 4090) with...	28	Experimental	98	Python
101	changwoolee/BLAST [NeurIPS 2024] BLAST: Block Level Adaptive Structured Matrix for Efficient...	27	Experimental	17	Python
102	KarthikSriramGit/H.E.I.M.D.A.L.L H.E.I.M.D.A.L.L looks at fleet telemetry and gives you natural-language...	27	Experimental	18	Jupyter Notebook
103	modelize-ai/LLM-Inference-Deployment-Tutorial Tutorial for LLM developers about engine design, service deployment,...	27	Experimental	19	Python
104	jmaczan/tiny-vllm High performance LLM inference engine, a younger sibling of vLLM	26	Experimental	12	C++
105	datvodinh/serve-llm Serve high throughput and scalable LLM using Ray and vLLM	25	Experimental	3	Makefile
106	dwain-barnes/LLM-GGUF-Auto-Converter Automated Jupyter notebook solution for batch converting Large Language...	25	Experimental	4	Jupyter Notebook
107	HelpingAI/inferno Run Llama 3.3, DeepSeek-R1, Phi-4, Gemma 3, Mistral Small 3.1, and other...	25	Experimental	8	Python
108	EmbeddedLLM/embeddedllm EmbeddedLLM: API server for Embedded Device Deployment. Currently support...	25	Experimental	51	Python
109	nitrictech/pycasts A text to Podcast inference API	24	Experimental	5	Python
110	tensorchord/inference-benchmark Benchmark for machine learning model online serving (LLM, embedding,...	24	Experimental	28	Python
111	ictnlp/SiLLM SiLLM is a Simultaneous Machine Translation (SiMT) Framework. It utilizes a...	23	Experimental	18	Python
112	mjglatzmaier/llm-boostrap Starter repo for running local LLM inference and lightweight benchmarking on...	22	Experimental	1	Python
113	adamydwang/mobilellama a lightweight C++ LLaMA inference engine for mobile devices	22	Experimental	15	C++
114	rafaelmaza/llmfit-web Find the best open-source LLM for your GPU/RAM - fit, speed & quality...	22	Experimental	1	JavaScript
115	AMD-AGI/gpt-fast The GPT-Fast for Multimodal Models on AMD GPUs	22	Experimental	6	Python
116	deepagency/llm-resource-planner A simple CLI tool to fetch Hugging Face model metadata and estimate required...	22	Experimental	—	Python
117	AntonioVFranco/elamonica Production-ready test-time compute optimization framework for LLM inference....	22	Experimental	1	Python
118	quantumnic/ssd-llm Run 70B+ LLMs on Apple Silicon by using SSD as extended memory — intelligent...	22	Experimental	1	Rust
119	TeamADAPT/blitzkernels BlitzKernels — production WASM inference kernels for edge AI (embedding,...	21	Experimental	—	Rust
120	llm-works/llm-infer LLM inference server with native, vLLM, and Ollama backends, including a...	21	Experimental	—	Python
121	iNeil77/vllm-code-harness Run code inference-only benchmarks quickly using vLLM	21	Experimental	9	Python
122	GPUforLLM/llm-vram-calculator Accurate VRAM calculator for Local LLMs (Llama 4, DeepSeek V3, Qwen 2.5)....	21	Experimental	2	HTML
123	NEBUL-AI/HF-VRAM-Extension VRAM calculator for Hugging Face models	21	Experimental	5	JavaScript
124	CornelisKuijpers/SIP-interface Run 400B+ parameter AI models on consumer hardware with 12GB RAM	21	Experimental	—	—
125	landry-some/LLM-streaming Efficient streaming inference for large language models (LLMs).	21	Experimental	—	Python
126	liam8421/faster-llm 🚀 Accelerate LLM training with Fast-LLM, an open-source library for...	21	Experimental	—	Python
127	onlychara553-debug/dgx-spark-inference-stack 🚀 Serve large language models efficiently at home with this Docker-based...	21	Experimental	—	JavaScript
128	MonitooDev/indiedroid-nova-llm 🚀 Benchmark local LLMs like Llama 3.1 on the Indiedroid Nova with RK3588...	21	Experimental	—	Python
129	isshiki-dev/docker-model-runner Self-hosted Anthropic API Compatible Inference Server with Claude Code...	20	Experimental	1	Python
130	X-rayLaser/DistributedLLM Run LLM inference by spliting models into parts and hosting each part on a...	20	Experimental	8	Python
131	arkodeepsen/helix Professional training stack for 100M parameter language models optimized for...	19	Experimental	5	Python
132	getflexai/flex_ai simplifies fine-tuning and inference for 60+ open-source LLMs through a single API	19	Experimental	3	Python
133	eniompw/llama-cpp-gpu Load larger models by offloading model layers to both GPU and CPU	19	Experimental	3	Jupyter Notebook
134	ThalesMMS/sglang-config Configuration files and deployment scripts for serving Llama 3.2 3B and Qwen...	19	Experimental	—	Shell
135	Artemarius/CuInfer From-scratch LLM inference engine in C++17/CUDA. Custom kernels, GGUF model...	19	Experimental	—	Cuda
136	johnbrodowski/AutoInferenceBenchmark AutoInferenceBenchmark is a Windows desktop application for evaluating and...	19	Experimental	—	C#
137	EvanZhuang/rocm_tips Tips for building and using DL packages for AMD ROCM	18	Experimental	2	—
138	di-osc/osc-llm 轻量级大模型推理引擎	17	Experimental	3	Python
139	Scieries-Reunies-de-l-Est/llm LLM deployment api of the Service Commercial company.	17	Experimental	—	Python
140	darxkies/cpu-slm A holiday project to better understand the inner workings of SLM/LLM.	17	Experimental	—	Rust
141	virtualramblas/DFloat11_MPS DFloat11 for Apple Silicon.	17	Experimental	—	Python
142	KT313/assistant_base A custom framework for easy use of LLMs, VLMs, etc. supporting various modes...	17	Experimental	1	Jupyter Notebook
143	Alexyskoutnev/TurboInference Welcome to TurboInference, a high-performance inference toolkit written in...	17	Experimental	1	—
144	piotrmaciejbednarski/llm-inference-tampering Proof-of-concept for persistent manipulation of LLM outputs by modifying...	17	Experimental	5	Python
145	Meahg/exvllm 🚀 Enhance vllm with exvllm to utilize MOE mixed inference, enabling...	17	Experimental	2	C++
146	nikelborm/amd-amdgpu-rocm-ollama-gfx90c-ati-radeon-vega-ryzen7-5800H-arch-linux Run Ollama on AMD Ryzen 7 5800H CPU with integrated GPU AMD ATI Radeon Vega...	15	Experimental	12	Shell
147	SunayHegde2006/Air.rs Air.rs 70B+ inference on consumer GPU, LLM inference in Rust	15	Experimental	1	Rust
148	1337hero/rx7900xtx-llama-bench-rocm Benchmark script for llama.cpp & results for AMD RX 7900 XTX	15	Experimental	—	Shell
149	rajatady/Inference-Stack Production-grade LLM inference API built from scratch. NestJS gateway +...	14	Experimental	1	TypeScript
150	soy-tuber/localllama-insights Technical insights from r/LocalLLaMA — vLLM, FP8, NVFP4, Blackwell GPU...	14	Experimental	—	—
151	Pyrolignic-paydirt84/pse-vcipher-collapse Accelerate LLM inference by collapsing attention paths with...	14	Experimental	—	C
152	rick97julho/do-i-have-the-vram 🔍 Estimate your VRAM needs for Hugging Face models in seconds without...	14	Experimental	—	Python
153	rinoScremin/Open_Cluster_AI_Station_beta High-performance distributed matrix computation for AI workloads. Supports...	14	Experimental	1	C++
154	vishvaRam/Docker-vLLM-Server-Builder-Runpod Production-grade, OpenAI-compatible server using vLLM v0.17.0. Deploy LLMs,...	13	Experimental	—	Shell
155	karun2328/llm_serving_benchmarks Benchmarking LLM inference serving with vLLM, analyzing latency, throughput,...	13	Experimental	—	Python
156	virtualramblas/FlexLLMGenMPS Running large language models on a single M1/M2 GPU for throughput-oriented...	13	Experimental	—	Python
157	joeddav/illustrated-training-cluster [WIP] Interactive visualization of LLM training parallelism across GPU clusters	13	Experimental	—	TypeScript
158	ZeeetOne/llm-inference-deployment Practical example of deploying fine-tuned LLMs locally with FastAPI....	13	Experimental	—	Python
159	G-B-KEVIN-ARJUN/runtime-inference "Faster AI: Accelerating Qwen 2.5 from 7 t/s to 82 t/s on a single RTX 4060...	13	Experimental	—	Python
160	biraj21/llm-server-from-scratch FastAPI server for locally serving Gemma 3 270M & OpenAI Whisper with...	13	Experimental	8	HTML
161	adithya-s-k/LLM-InferenceNet LLM InferenceNet is a C++ project designed to facilitate fast and efficient...	12	Experimental	7	C++
162	keisuke-okb/llm-tokenwise-inference Token-wise and real-time display Inference module for Llama2 and other LLMs.	11	Experimental	—	Python
163	dae9999nam/LLM_C This repository is to optimize the throughput of Large Language Model...	10	Experimental	1	C
164	hades255/benchmarking-llama_install-on-modular_max Benchmark Inference Stack (vLLM vs Modular/MAX)	10	Experimental	1	Python

Comparisons in this category

sglang and vllm (87 vs 87) vllm and MNN (87 vs 80) vllm and inference (87 vs 76) vllm and PowerInfer (87 vs 54) vllm and gpustack (87 vs 68) vllm and LightLLM (87 vs 65) vllm and xllm (87 vs 69) vllm and rtp-llm (87 vs 70) vllm and ZhiLight (87 vs 59) vllm and Automodel (87 vs 59)