LLM Inference Engines LLM Tools

High-performance inference frameworks and engines optimized for deploying and serving LLMs efficiently across various hardware accelerators and resource-constrained devices. Does NOT include LLM training frameworks, fine-tuning tools, or application-level chatbot/UI wrappers.

There are 29 llm inference engines tools tracked. 1 score above 70 (verified tier). The highest-rated is vllm-project/vllm-ascend at 73/100 with 1,773 stars. 6 of the top 10 are actively maintained.

Get all 29 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=llm-inference-engines&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	vllm-project/vllm-ascend Community maintained hardware plugin for vLLM on Ascend	73	Verified	1,773	C++
2	SemiAnalysisAI/InferenceX Open Source Continuous Inference Benchmarking Qwen3.5, DeepSeek, GPTOSS -...	69	Established	655	Python
3	kvcache-ai/Mooncake Mooncake is the serving platform for Kimi, a leading LLM service provided by...	69	Established	4,911	C++
4	uccl-project/uccl UCCL is an efficient communication library for GPUs, covering collectives,...	68	Established	1,234	C++
5	sophgo/tpu-mlir Machine learning compiler based on MLIR for Sophgo TPU.	68	Established	872	C++
6	BBuf/how-to-optim-algorithm-in-cuda how to optimize some algorithm in cuda.	55	Established	2,863	Cuda
7	RightNow-AI/picolm Run a 1-billion parameter LLM on a $10 board with 256MB RAM	52	Established	1,364	C
8	jinbooooom/ai-infra-hpc hpc 教程，包含集合通信(mpi、nccl)、cuda 编程、向量化 SIMD、RDMA 通信等	51	Established	321	Cuda
9	zjhellofss/KuiperLLama 校招、秋招、春招、实习好项目，带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。	49	Emerging	509	C++
10	RayFernando1337/LLM-Calc Instantly calculate the maximum size of quantized language models that can...	46	Emerging	253	TypeScript
11	erans/selfhostllm A web-based calculator for estimating GPU memory requirements and maximum...	42	Emerging	37	HTML
12	amirgholami/ai_and_memory_wall AI and Memory Wall	42	Emerging	226	—
13	bd4sur/Nano 电子鹦鹉 / Toy Language Model	38	Emerging	264	C
14	ChiefGyk3D/FrankenLLM Stitched-together GPUs, but it lives! Run different LLM models optimally...	36	Emerging	9	Shell
15	FilipFan/PolyEngineInfer Run LLM inference in an Android app with llama.cpp, ExecuTorch, LiteRT,...	34	Emerging	7	Kotlin
16	Alex188dot/GPU-VRAM-Calculator A simple tool to find out GPU VRAM requirements for running LLMs	31	Emerging	4	HTML
17	refinefuture-ai/refft.cpp A new approach of running LLM/LMs' inference/training on GPU/NPU backends...	27	Experimental	1	—
18	hofong428/Optimizing-GPU-Kernels LLM Serving & Inference Optimization	25	Experimental	8	—
19	PrajwalNeeralagi/nano-vllm 🚀 Implement fast offline inference with Nano-vLLM, a lightweight and...	24	Experimental	—	Python
20	George614/gpu-mem-calculator GPU Memory Calculator for LLM Training - Calculate GPU memory requirements...	24	Experimental	3	Python
21	darekhta/marmot High-performance LLM inference engine in C23 with CPU and Metal backends,...	22	Experimental	—	C
22	r3tr056/loc-ai-ly Locaily - Making Large Language Model Inference Accessible on Consumer Hardware	22	Experimental	1	C++
23	manishklach/SRMIC_X1 Analytical simulator for SRMIC — a residency-first LLM inference accelerator...	22	Experimental	—	SystemVerilog
24	simar-rekhi/triton LLM-assisted compiler pass generation with Triton & CUDA	22	Experimental	1	Jupyter Notebook
25	Jugurthakebaili1/vLLM-Kunlun 🛠 Enhance vLLM performance on Kunlun XPU with this hardware plugin, offering...	21	Experimental	—	Python
26	elibutters/CascadeInference Cascade based inference for LLMs	13	Experimental	—	Python
27	LessUp/hetero-paged-infer PagedAttention + Continuous Batching Inference Engine Prototype (Rust):...	13	Experimental	—	Rust
28	MetaxisResearch/parallax Distributed inference across heterogeneous hardware.	11	Experimental	—	Python
29	rahulunair/xpu_tgi TGI server setup for Intel Data Centre GPUs	11	Experimental	—	Shell