LLM Inference Engines Transformer Models
Optimized inference engines and serving systems for deploying and running large language models efficiently. Focuses on throughput, latency, memory optimization, and production deployment. Does NOT include training frameworks, fine-tuning methods, quantization techniques, or model architecture implementations.
There are 164 llm inference engines models tracked. 7 score above 70 (verified tier). The highest-rated is vllm-project/vllm at 87/100 with 73,007 stars. 10 of the top 10 are actively maintained.
Get all 164 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=llm-inference-engines&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Model | Score | Tier |
|---|---|---|---|
| 1 |
vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs |
|
Verified |
| 2 |
sgl-project/sglang
SGLang is a high-performance serving framework for large language models and... |
|
Verified |
| 3 |
alibaba/MNN
MNN: A blazing-fast, lightweight inference engine battle-tested by Alibaba,... |
|
Verified |
| 4 |
xorbitsai/inference
Swap GPT for any LLM by changing a single line of code. Xinference lets you... |
|
Verified |
| 5 |
tensorzero/tensorzero
TensorZero is an open-source stack for industrial-grade LLM applications. It... |
|
Verified |
| 6 |
tenstorrent/tt-metal
:metal: TT-NN operator library, and TT-Metalium low level kernel programming model. |
|
Verified |
| 7 |
alibaba/rtp-llm
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications. |
|
Verified |
| 8 |
jd-opensource/xllm
A high-performance inference engine for LLMs, optimized for diverse AI accelerators. |
|
Established |
| 9 |
gpustack/gpustack
Performance-optimized AI inference on your GPUs. Unlock superior throughput... |
|
Established |
| 10 |
ARahim3/mlx-tune
Bringing the Unsloth experience to Mac users via Apple's MLX framework |
|
Established |
| 11 |
InternLM/lmdeploy
LMDeploy is a toolkit for compressing, deploying, and serving LLMs. |
|
Established |
| 12 |
ModelTC/LightLLM
LightLLM is a Python-based LLM (Large Language Model) inference and serving... |
|
Established |
| 13 |
FastFlowLM/FastFlowLM
Run LLMs on AMD Ryzen™ AI NPUs in minutes. Just like Ollama - but... |
|
Established |
| 14 |
NexaAI/nexa-sdk
Run frontier LLMs and VLMs with day-0 model support across GPU, NPU, and... |
|
Established |
| 15 |
NVIDIA-NeMo/Automodel
Pytorch Distributed native training library for LLMs/VLMs with OOTB Hugging... |
|
Established |
| 16 |
zhihu/ZhiLight
A highly optimized LLM inference acceleration engine for Llama and its variants. |
|
Established |
| 17 |
underneathall/pinferencia
Python + Inference - Model Deployment library in Python. Simplest model... |
|
Established |
| 18 |
ai-decentralized/BloomBee
Decentralized LLMs fine-tuning and inference with offloading |
|
Established |
| 19 |
bigscience-workshop/petals
🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x... |
|
Established |
| 20 |
toverainc/willow-inference-server
Open source, local, and self-hosted highly optimized language inference... |
|
Established |
| 21 |
Tiiny-AI/PowerInfer
High-speed Large Language Model Serving for Local Deployment |
|
Established |
| 22 |
GeeeekExplorer/nano-vllm
Nano vLLM |
|
Established |
| 23 |
livepeer/ai-runner
Inference runtime for running different batch and real-time AI pipelines. |
|
Established |
| 24 |
alibaba/InferSim
A Lightweight LLM Inference Performance Simulator |
|
Established |
| 25 |
microsoft/vidur
A large-scale simulation framework for LLM inference |
|
Established |
| 26 |
zhenye234/LLaSA_training
LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis |
|
Established |
| 27 |
AI-Hypercomputer/JetStream
JetStream is a throughput and memory optimized engine for LLM inference on... |
|
Established |
| 28 |
vitoplantamura/OnnxStream
Lightweight inference library for ONNX files, written in C++. It can run... |
|
Established |
| 29 |
kennethleungty/Llama-2-Open-Source-LLM-CPU-Inference
Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q&A |
|
Established |
| 30 |
microsoft/sarathi-serve
A low-latency & high-throughput serving engine for LLMs |
|
Established |
| 31 |
Troyanovsky/Local-LLM-Comparison-Colab-UI
Compare the performance of different LLM that can be deployed locally on... |
|
Established |
| 32 |
jina-ai/rungpt
An open-source cloud-native of large multi-modal models (LMMs) serving framework. |
|
Established |
| 33 |
Deep-Spark/DeepSparkInference
DeepSparkInference has selected 216 inference models of both small and large... |
|
Emerging |
| 34 |
higgsfield-ai/higgsfield
Fault-tolerant, highly scalable GPU orchestration, and a machine learning... |
|
Emerging |
| 35 |
intel/ipex-llm
Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM,... |
|
Emerging |
| 36 |
slwang-ustc/nano-vllm-v1
Nano vLLM with vLLM v1's request scheduling strategy and chunked prefill |
|
Emerging |
| 37 |
SearchSavior/OpenArc
Inference engine for Intel devices. Serve LLMs, VLMs, Whisper, Kokoro-TTS,... |
|
Emerging |
| 38 |
vectorch-ai/ScaleLLM
A high-performance inference system for large language models, designed for... |
|
Emerging |
| 39 |
bytedance/byteir
A model compilation solution for various hardware |
|
Emerging |
| 40 |
MegEngine/InferLLM
a lightweight LLM model inference framework |
|
Emerging |
| 41 |
RWKV/rwkv.cpp
INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model |
|
Emerging |
| 42 |
inclusionAI/asystem-awex
A high-performance RL training-inference weight synchronization framework,... |
|
Emerging |
| 43 |
powerserve-project/PowerServe
High-speed and easy-use LLM serving framework for local deployment |
|
Emerging |
| 44 |
interestingLSY/swiftLLM
A tiny yet powerful LLM inference system tailored for researching purpose.... |
|
Emerging |
| 45 |
andrewkchan/deepseek.cpp
CPU inference for the DeepSeek family of large language models in C++ |
|
Emerging |
| 46 |
SqueezeAILab/LLMCompiler
[ICML 2024] LLMCompiler: An LLM Compiler for Parallel Function Calling |
|
Emerging |
| 47 |
1b5d/llm-api
Run any Large Language Model behind a unified API |
|
Emerging |
| 48 |
AI-Hypercomputer/jetstream-pytorch
PyTorch/XLA integration with JetStream (https://github.com/google/JetStream)... |
|
Emerging |
| 49 |
PureBee/purebee
A GPU defined in software. Runs Llama 3.2 1B at 3.6 tok/sec. Zero dependencies. |
|
Emerging |
| 50 |
modelscope/dash-infer
DashInfer is a native LLM inference engine aiming to deliver... |
|
Emerging |
| 51 |
jankais3r/LLaMA_MPS
Run LLaMA (and Stanford-Alpaca) inference on Apple Silicon GPUs. |
|
Emerging |
| 52 |
Azure99/BlossomData
A fluent, scalable, and easy-to-use LLM data processing framework. |
|
Emerging |
| 53 |
chenmozhijin/BSRoformer.cpp
GGML-based C++ inference for BS Roformer/Mel-Band-Roformer vocal separation... |
|
Emerging |
| 54 |
zejia-lin/BulletServe
Boosting GPU utilization for LLM serving via dynamic spatial-temporal... |
|
Emerging |
| 55 |
aniketmaurya/llm-inference
Large Language Model (LLM) Inference API and Chatbot |
|
Emerging |
| 56 |
James-QiuHaoran/LLM-serving-with-proxy-models
Efficient Interactive LLM Serving with Proxy Model-based Sequence Length... |
|
Emerging |
| 57 |
riccardomusmeci/mlx-llm
Large Language Models (LLMs) applications and tools running on Apple Silicon... |
|
Emerging |
| 58 |
MrYxJ/calculate-flops.pytorch
The calflops is designed to calculate FLOPs、MACs and Parameters in all... |
|
Emerging |
| 59 |
hpcaitech/SwiftInfer
Efficient AI Inference & Serving |
|
Emerging |
| 60 |
argonne-lcf/LLM-Inference-Bench
LLM-Inference-Bench |
|
Emerging |
| 61 |
toyaix/TritonLLM
LLM Inference via Triton (Flexible & Modular): Focused on Kernel... |
|
Emerging |
| 62 |
jdaln/dgx-spark-inference-stack
Serve the home! Inference stack for your Nvidia DGX Spark aka the Grace... |
|
Emerging |
| 63 |
AmpereComputingAI/llama.cpp
Ampere optimized llama.cpp |
|
Emerging |
| 64 |
TrevTron/indiedroid-nova-llm
Running Llama 3.1 8B and other LLMs on RK3588 NPU - benchmarks and setup guides |
|
Emerging |
| 65 |
efeslab/Nanoflow
A throughput-oriented high-performance serving framework for LLMs |
|
Emerging |
| 66 |
thruthseeker/LionLock_FDE_OSS
Open source fatigue detection engine for large language models with trust overlay |
|
Emerging |
| 67 |
knagrecha/saturn
Saturn accelerates the training of large-scale deep learning models with a... |
|
Emerging |
| 68 |
zRzRzRzRzRzRzR/lm-fly
大模型推理框架加速,让 LLM 飞起来 |
|
Emerging |
| 69 |
CoderLSF/fast-llama
Runs LLaMA with Extremely HIGH speed |
|
Emerging |
| 70 |
rbitr/llm.f90
LLM inference in Fortran |
|
Emerging |
| 71 |
ShinoharaHare/LLM-Training
A distributed training framework for large language models powered by Lightning. |
|
Emerging |
| 72 |
invergent-ai/surogate
Insanely fast LLM pre-training and fine-tuning for modern NVIDIA GPUs.... |
|
Emerging |
| 73 |
andrewkchan/yalm
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O |
|
Emerging |
| 74 |
gotzmann/booster
Booster - open accelerator for LLM models. Better inference and debugging... |
|
Emerging |
| 75 |
m-horky/sllm
Tools using small Large Language Models |
|
Emerging |
| 76 |
m0dulo/InferSpore
🌱 A fully independent Large Language Model (LLM) inference engine, built... |
|
Emerging |
| 77 |
moeru-ai/demodel
🚀🛸 Easily boost the speed of pulling your models and datasets from various... |
|
Emerging |
| 78 |
alibaba/easydist
Automated Parallelization System and Infrastructure for Multiple Ecosystems |
|
Emerging |
| 79 |
lucasjinreal/Namo-R1
A CPU Realtime VLM in 500M. Surpassed Moondream2 and SmolVLM. Training from... |
|
Emerging |
| 80 |
nareshis21/Truelarge-RT
Android inference engine running 20B+ parameter LLMs on 4GB-8GB RAM devices.... |
|
Emerging |
| 81 |
vivy-yi/awesome-llm-training-inference
Curated list of LLM training and inference frameworks, tools, and resources.... |
|
Emerging |
| 82 |
RahulSChand/gpu_poor
Calculate token/s & GPU memory requirement for any LLM. Supports... |
|
Emerging |
| 83 |
yingding/applyllm
A python package for applying LLM with LangChain and Hugging Face on local... |
|
Emerging |
| 84 |
gunnarnordqvist/opencode-context-filter
Transparent HTTP proxy that automatically filters repository context for... |
|
Emerging |
| 85 |
AshishGautamX/K8s-LLM-Scheduler
An intelligent Kubernetes scheduler powered by Meta's Llama-3.3-70B model... |
|
Emerging |
| 86 |
psmarter/mini-infer
A high-performance LLM inference engine with PagedAttention |... |
|
Emerging |
| 87 |
winstxnhdw/llm-api
A fast CPU-based API for Qwen 2.5 using CTranslate2, hosted on Hugging Face Spaces. |
|
Emerging |
| 88 |
dengls24/LLM-para
Analyze LLM inference: FLOPs, memory, Roofline model. Supports GQA, MoE,... |
|
Emerging |
| 89 |
kennethleungty/DeepSeek-R1-Ollama-Simple-Evals
Run and Evaluate DeepSeek-R1 Distilled Models Locally with Ollama and... |
|
Emerging |
| 90 |
HyperMink/inferenceable
Scalable AI Inference Server for CPU and GPU with Node.js | Utilizes... |
|
Emerging |
| 91 |
tommasocerruti/detllm
Deterministic-mode checks for LLM inference: measure run/batch variance,... |
|
Emerging |
| 92 |
ybubnov/metalchat
Pure C++23 Llama inference for Apple Silicon chips |
|
Emerging |
| 93 |
Relaxed-System-Lab/HexGen
[ICML 2024] Serving LLMs on heterogeneous decentralized clusters. |
|
Emerging |
| 94 |
titanml/takeoff-community
TitanML Takeoff Server is an optimization, compression and deployment... |
|
Emerging |
| 95 |
bpevangelista/vfastml
Inference and Training Engine for LLMs, Image2Image and Other Models |
|
Emerging |
| 96 |
KevinLee1110/dynamic-batching
The official repo for the paper "Optimizing LLM Inference Throughput via... |
|
Emerging |
| 97 |
harleyszhang/llm_counts
llm theoretical performance analysis tools and support params, flops, memory... |
|
Emerging |
| 98 |
ToddThomson/Mila
Achilles Mila Deep Neural Network library provides a comprehensive API to... |
|
Emerging |
| 99 |
VPanjeta/PyLLaMa-CPU
Fast LLaMa inference on CPU using llama.cpp for Python |
|
Experimental |
| 100 |
BenChaliah/NVFP4-on-4090-vLLM
AdaLLM is an NVFP4-first inference runtime for Ada Lovelace (RTX 4090) with... |
|
Experimental |
| 101 |
changwoolee/BLAST
[NeurIPS 2024] BLAST: Block Level Adaptive Structured Matrix for Efficient... |
|
Experimental |
| 102 |
KarthikSriramGit/H.E.I.M.D.A.L.L
H.E.I.M.D.A.L.L looks at fleet telemetry and gives you natural-language... |
|
Experimental |
| 103 |
modelize-ai/LLM-Inference-Deployment-Tutorial
Tutorial for LLM developers about engine design, service deployment,... |
|
Experimental |
| 104 |
jmaczan/tiny-vllm
High performance LLM inference engine, a younger sibling of vLLM |
|
Experimental |
| 105 |
datvodinh/serve-llm
Serve high throughput and scalable LLM using Ray and vLLM |
|
Experimental |
| 106 |
dwain-barnes/LLM-GGUF-Auto-Converter
Automated Jupyter notebook solution for batch converting Large Language... |
|
Experimental |
| 107 |
HelpingAI/inferno
Run Llama 3.3, DeepSeek-R1, Phi-4, Gemma 3, Mistral Small 3.1, and other... |
|
Experimental |
| 108 |
EmbeddedLLM/embeddedllm
EmbeddedLLM: API server for Embedded Device Deployment. Currently support... |
|
Experimental |
| 109 |
nitrictech/pycasts
A text to Podcast inference API |
|
Experimental |
| 110 |
tensorchord/inference-benchmark
Benchmark for machine learning model online serving (LLM, embedding,... |
|
Experimental |
| 111 |
ictnlp/SiLLM
SiLLM is a Simultaneous Machine Translation (SiMT) Framework. It utilizes a... |
|
Experimental |
| 112 |
mjglatzmaier/llm-boostrap
Starter repo for running local LLM inference and lightweight benchmarking on... |
|
Experimental |
| 113 |
adamydwang/mobilellama
a lightweight C++ LLaMA inference engine for mobile devices |
|
Experimental |
| 114 |
rafaelmaza/llmfit-web
Find the best open-source LLM for your GPU/RAM - fit, speed & quality... |
|
Experimental |
| 115 |
AMD-AGI/gpt-fast
The GPT-Fast for Multimodal Models on AMD GPUs |
|
Experimental |
| 116 |
deepagency/llm-resource-planner
A simple CLI tool to fetch Hugging Face model metadata and estimate required... |
|
Experimental |
| 117 |
AntonioVFranco/elamonica
Production-ready test-time compute optimization framework for LLM inference.... |
|
Experimental |
| 118 |
quantumnic/ssd-llm
Run 70B+ LLMs on Apple Silicon by using SSD as extended memory — intelligent... |
|
Experimental |
| 119 |
TeamADAPT/blitzkernels
BlitzKernels — production WASM inference kernels for edge AI (embedding,... |
|
Experimental |
| 120 |
llm-works/llm-infer
LLM inference server with native, vLLM, and Ollama backends, including a... |
|
Experimental |
| 121 |
iNeil77/vllm-code-harness
Run code inference-only benchmarks quickly using vLLM |
|
Experimental |
| 122 |
GPUforLLM/llm-vram-calculator
Accurate VRAM calculator for Local LLMs (Llama 4, DeepSeek V3, Qwen 2.5).... |
|
Experimental |
| 123 |
NEBUL-AI/HF-VRAM-Extension
VRAM calculator for Hugging Face models |
|
Experimental |
| 124 |
CornelisKuijpers/SIP-interface
Run 400B+ parameter AI models on consumer hardware with 12GB RAM |
|
Experimental |
| 125 |
landry-some/LLM-streaming
Efficient streaming inference for large language models (LLMs). |
|
Experimental |
| 126 |
liam8421/faster-llm
🚀 Accelerate LLM training with Fast-LLM, an open-source library for... |
|
Experimental |
| 127 |
onlychara553-debug/dgx-spark-inference-stack
🚀 Serve large language models efficiently at home with this Docker-based... |
|
Experimental |
| 128 |
MonitooDev/indiedroid-nova-llm
🚀 Benchmark local LLMs like Llama 3.1 on the Indiedroid Nova with RK3588... |
|
Experimental |
| 129 |
isshiki-dev/docker-model-runner
Self-hosted Anthropic API Compatible Inference Server with Claude Code... |
|
Experimental |
| 130 |
X-rayLaser/DistributedLLM
Run LLM inference by spliting models into parts and hosting each part on a... |
|
Experimental |
| 131 |
arkodeepsen/helix
Professional training stack for 100M parameter language models optimized for... |
|
Experimental |
| 132 |
getflexai/flex_ai
simplifies fine-tuning and inference for 60+ open-source LLMs through a single API |
|
Experimental |
| 133 |
eniompw/llama-cpp-gpu
Load larger models by offloading model layers to both GPU and CPU |
|
Experimental |
| 134 |
ThalesMMS/sglang-config
Configuration files and deployment scripts for serving Llama 3.2 3B and Qwen... |
|
Experimental |
| 135 |
Artemarius/CuInfer
From-scratch LLM inference engine in C++17/CUDA. Custom kernels, GGUF model... |
|
Experimental |
| 136 |
johnbrodowski/AutoInferenceBenchmark
AutoInferenceBenchmark is a Windows desktop application for evaluating and... |
|
Experimental |
| 137 |
EvanZhuang/rocm_tips
Tips for building and using DL packages for AMD ROCM |
|
Experimental |
| 138 |
di-osc/osc-llm
轻量级大模型推理引擎 |
|
Experimental |
| 139 |
Scieries-Reunies-de-l-Est/llm
LLM deployment api of the Service Commercial company. |
|
Experimental |
| 140 |
darxkies/cpu-slm
A holiday project to better understand the inner workings of SLM/LLM. |
|
Experimental |
| 141 |
virtualramblas/DFloat11_MPS
DFloat11 for Apple Silicon. |
|
Experimental |
| 142 |
KT313/assistant_base
A custom framework for easy use of LLMs, VLMs, etc. supporting various modes... |
|
Experimental |
| 143 |
Alexyskoutnev/TurboInference
Welcome to TurboInference, a high-performance inference toolkit written in... |
|
Experimental |
| 144 |
piotrmaciejbednarski/llm-inference-tampering
Proof-of-concept for persistent manipulation of LLM outputs by modifying... |
|
Experimental |
| 145 |
Meahg/exvllm
🚀 Enhance vllm with exvllm to utilize MOE mixed inference, enabling... |
|
Experimental |
| 146 |
nikelborm/amd-amdgpu-rocm-ollama-gfx90c-ati-radeon-vega-ryzen7-5800H-arch-linux
Run Ollama on AMD Ryzen 7 5800H CPU with integrated GPU AMD ATI Radeon Vega... |
|
Experimental |
| 147 |
SunayHegde2006/Air.rs
Air.rs 70B+ inference on consumer GPU, LLM inference in Rust |
|
Experimental |
| 148 |
1337hero/rx7900xtx-llama-bench-rocm
Benchmark script for llama.cpp & results for AMD RX 7900 XTX |
|
Experimental |
| 149 |
rajatady/Inference-Stack
Production-grade LLM inference API built from scratch. NestJS gateway +... |
|
Experimental |
| 150 |
soy-tuber/localllama-insights
Technical insights from r/LocalLLaMA — vLLM, FP8, NVFP4, Blackwell GPU... |
|
Experimental |
| 151 |
Pyrolignic-paydirt84/pse-vcipher-collapse
Accelerate LLM inference by collapsing attention paths with... |
|
Experimental |
| 152 |
rick97julho/do-i-have-the-vram
🔍 Estimate your VRAM needs for Hugging Face models in seconds without... |
|
Experimental |
| 153 |
rinoScremin/Open_Cluster_AI_Station_beta
High-performance distributed matrix computation for AI workloads. Supports... |
|
Experimental |
| 154 |
vishvaRam/Docker-vLLM-Server-Builder-Runpod
Production-grade, OpenAI-compatible server using vLLM v0.17.0. Deploy LLMs,... |
|
Experimental |
| 155 |
karun2328/llm_serving_benchmarks
Benchmarking LLM inference serving with vLLM, analyzing latency, throughput,... |
|
Experimental |
| 156 |
virtualramblas/FlexLLMGenMPS
Running large language models on a single M1/M2 GPU for throughput-oriented... |
|
Experimental |
| 157 |
joeddav/illustrated-training-cluster
[WIP] Interactive visualization of LLM training parallelism across GPU clusters |
|
Experimental |
| 158 |
ZeeetOne/llm-inference-deployment
Practical example of deploying fine-tuned LLMs locally with FastAPI.... |
|
Experimental |
| 159 |
G-B-KEVIN-ARJUN/runtime-inference
"Faster AI: Accelerating Qwen 2.5 from 7 t/s to 82 t/s on a single RTX 4060... |
|
Experimental |
| 160 |
biraj21/llm-server-from-scratch
FastAPI server for locally serving Gemma 3 270M & OpenAI Whisper with... |
|
Experimental |
| 161 |
adithya-s-k/LLM-InferenceNet
LLM InferenceNet is a C++ project designed to facilitate fast and efficient... |
|
Experimental |
| 162 |
keisuke-okb/llm-tokenwise-inference
Token-wise and real-time display Inference module for Llama2 and other LLMs. |
|
Experimental |
| 163 |
dae9999nam/LLM_C
This repository is to optimize the throughput of Large Language Model... |
|
Experimental |
| 164 |
hades255/benchmarking-llama_install-on-modular_max
Benchmark Inference Stack (vLLM vs Modular/MAX) |
|
Experimental |