LLM Inference Engines Transformer Models

Optimized inference engines and serving systems for deploying and running large language models efficiently. Focuses on throughput, latency, memory optimization, and production deployment. Does NOT include training frameworks, fine-tuning methods, quantization techniques, or model architecture implementations.

There are 164 llm inference engines models tracked. 7 score above 70 (verified tier). The highest-rated is vllm-project/vllm at 87/100 with 73,007 stars. 10 of the top 10 are actively maintained.

Get all 164 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=llm-inference-engines&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Model Score Tier
1 vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

87
Verified
2 sgl-project/sglang

SGLang is a high-performance serving framework for large language models and...

87
Verified
3 alibaba/MNN

MNN: A blazing-fast, lightweight inference engine battle-tested by Alibaba,...

80
Verified
4 xorbitsai/inference

Swap GPT for any LLM by changing a single line of code. Xinference lets you...

76
Verified
5 tensorzero/tensorzero

TensorZero is an open-source stack for industrial-grade LLM applications. It...

76
Verified
6 tenstorrent/tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.

73
Verified
7 alibaba/rtp-llm

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

70
Verified
8 jd-opensource/xllm

A high-performance inference engine for LLMs, optimized for diverse AI accelerators.

69
Established
9 gpustack/gpustack

Performance-optimized AI inference on your GPUs. Unlock superior throughput...

68
Established
10 ARahim3/mlx-tune

Bringing the Unsloth experience to Mac users via Apple's MLX framework

68
Established
11 InternLM/lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

67
Established
12 ModelTC/LightLLM

LightLLM is a Python-based LLM (Large Language Model) inference and serving...

65
Established
13 FastFlowLM/FastFlowLM

Run LLMs on AMD Ryzen™ AI NPUs in minutes. Just like Ollama - but...

62
Established
14 NexaAI/nexa-sdk

Run frontier LLMs and VLMs with day-0 model support across GPU, NPU, and...

60
Established
15 NVIDIA-NeMo/Automodel

Pytorch Distributed native training library for LLMs/VLMs with OOTB Hugging...

59
Established
16 zhihu/ZhiLight

A highly optimized LLM inference acceleration engine for Llama and its variants.

59
Established
17 underneathall/pinferencia

Python + Inference - Model Deployment library in Python. Simplest model...

56
Established
18 ai-decentralized/BloomBee

Decentralized LLMs fine-tuning and inference with offloading

55
Established
19 bigscience-workshop/petals

🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x...

54
Established
20 toverainc/willow-inference-server

Open source, local, and self-hosted highly optimized language inference...

54
Established
21 Tiiny-AI/PowerInfer

High-speed Large Language Model Serving for Local Deployment

54
Established
22 GeeeekExplorer/nano-vllm

Nano vLLM

53
Established
23 livepeer/ai-runner

Inference runtime for running different batch and real-time AI pipelines.

53
Established
24 alibaba/InferSim

A Lightweight LLM Inference Performance Simulator

52
Established
25 microsoft/vidur

A large-scale simulation framework for LLM inference

52
Established
26 zhenye234/LLaSA_training

LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis

52
Established
27 AI-Hypercomputer/JetStream

JetStream is a throughput and memory optimized engine for LLM inference on...

51
Established
28 vitoplantamura/OnnxStream

Lightweight inference library for ONNX files, written in C++. It can run...

51
Established
29 kennethleungty/Llama-2-Open-Source-LLM-CPU-Inference

Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q&A

51
Established
30 microsoft/sarathi-serve

A low-latency & high-throughput serving engine for LLMs

50
Established
31 Troyanovsky/Local-LLM-Comparison-Colab-UI

Compare the performance of different LLM that can be deployed locally on...

50
Established
32 jina-ai/rungpt

An open-source cloud-native of large multi-modal models (LMMs) serving framework.

50
Established
33 Deep-Spark/DeepSparkInference

DeepSparkInference has selected 216 inference models of both small and large...

49
Emerging
34 higgsfield-ai/higgsfield

Fault-tolerant, highly scalable GPU orchestration, and a machine learning...

49
Emerging
35 intel/ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM,...

49
Emerging
36 slwang-ustc/nano-vllm-v1

Nano vLLM with vLLM v1's request scheduling strategy and chunked prefill

48
Emerging
37 SearchSavior/OpenArc

Inference engine for Intel devices. Serve LLMs, VLMs, Whisper, Kokoro-TTS,...

48
Emerging
38 vectorch-ai/ScaleLLM

A high-performance inference system for large language models, designed for...

47
Emerging
39 bytedance/byteir

A model compilation solution for various hardware

46
Emerging
40 MegEngine/InferLLM

a lightweight LLM model inference framework

46
Emerging
41 RWKV/rwkv.cpp

INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model

45
Emerging
42 inclusionAI/asystem-awex

A high-performance RL training-inference weight synchronization framework,...

45
Emerging
43 powerserve-project/PowerServe

High-speed and easy-use LLM serving framework for local deployment

44
Emerging
44 interestingLSY/swiftLLM

A tiny yet powerful LLM inference system tailored for researching purpose....

44
Emerging
45 andrewkchan/deepseek.cpp

CPU inference for the DeepSeek family of large language models in C++

44
Emerging
46 SqueezeAILab/LLMCompiler

[ICML 2024] LLMCompiler: An LLM Compiler for Parallel Function Calling

44
Emerging
47 1b5d/llm-api

Run any Large Language Model behind a unified API

44
Emerging
48 AI-Hypercomputer/jetstream-pytorch

PyTorch/XLA integration with JetStream (https://github.com/google/JetStream)...

44
Emerging
49 PureBee/purebee

A GPU defined in software. Runs Llama 3.2 1B at 3.6 tok/sec. Zero dependencies.

43
Emerging
50 modelscope/dash-infer

DashInfer is a native LLM inference engine aiming to deliver...

43
Emerging
51 jankais3r/LLaMA_MPS

Run LLaMA (and Stanford-Alpaca) inference on Apple Silicon GPUs.

42
Emerging
52 Azure99/BlossomData

A fluent, scalable, and easy-to-use LLM data processing framework.

42
Emerging
53 chenmozhijin/BSRoformer.cpp

GGML-based C++ inference for BS Roformer/Mel-Band-Roformer vocal separation...

40
Emerging
54 zejia-lin/BulletServe

Boosting GPU utilization for LLM serving via dynamic spatial-temporal...

40
Emerging
55 aniketmaurya/llm-inference

Large Language Model (LLM) Inference API and Chatbot

40
Emerging
56 James-QiuHaoran/LLM-serving-with-proxy-models

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length...

39
Emerging
57 riccardomusmeci/mlx-llm

Large Language Models (LLMs) applications and tools running on Apple Silicon...

39
Emerging
58 MrYxJ/calculate-flops.pytorch

The calflops is designed to calculate FLOPs、MACs and Parameters in all...

39
Emerging
59 hpcaitech/SwiftInfer

Efficient AI Inference & Serving

39
Emerging
60 argonne-lcf/LLM-Inference-Bench

LLM-Inference-Bench

38
Emerging
61 toyaix/TritonLLM

LLM Inference via Triton (Flexible & Modular): Focused on Kernel...

38
Emerging
62 jdaln/dgx-spark-inference-stack

Serve the home! Inference stack for your Nvidia DGX Spark aka the Grace...

38
Emerging
63 AmpereComputingAI/llama.cpp

Ampere optimized llama.cpp

38
Emerging
64 TrevTron/indiedroid-nova-llm

Running Llama 3.1 8B and other LLMs on RK3588 NPU - benchmarks and setup guides

38
Emerging
65 efeslab/Nanoflow

A throughput-oriented high-performance serving framework for LLMs

38
Emerging
66 thruthseeker/LionLock_FDE_OSS

Open source fatigue detection engine for large language models with trust overlay

38
Emerging
67 knagrecha/saturn

Saturn accelerates the training of large-scale deep learning models with a...

37
Emerging
68 zRzRzRzRzRzRzR/lm-fly

大模型推理框架加速,让 LLM 飞起来

37
Emerging
69 CoderLSF/fast-llama

Runs LLaMA with Extremely HIGH speed

37
Emerging
70 rbitr/llm.f90

LLM inference in Fortran

37
Emerging
71 ShinoharaHare/LLM-Training

A distributed training framework for large language models powered by Lightning.

37
Emerging
72 invergent-ai/surogate

Insanely fast LLM pre-training and fine-tuning for modern NVIDIA GPUs....

37
Emerging
73 andrewkchan/yalm

Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O

37
Emerging
74 gotzmann/booster

Booster - open accelerator for LLM models. Better inference and debugging...

36
Emerging
75 m-horky/sllm

Tools using small Large Language Models

36
Emerging
76 m0dulo/InferSpore

🌱 A fully independent Large Language Model (LLM) inference engine, built...

36
Emerging
77 moeru-ai/demodel

🚀🛸 Easily boost the speed of pulling your models and datasets from various...

36
Emerging
78 alibaba/easydist

Automated Parallelization System and Infrastructure for Multiple Ecosystems

35
Emerging
79 lucasjinreal/Namo-R1

A CPU Realtime VLM in 500M. Surpassed Moondream2 and SmolVLM. Training from...

35
Emerging
80 nareshis21/Truelarge-RT

Android inference engine running 20B+ parameter LLMs on 4GB-8GB RAM devices....

34
Emerging
81 vivy-yi/awesome-llm-training-inference

Curated list of LLM training and inference frameworks, tools, and resources....

34
Emerging
82 RahulSChand/gpu_poor

Calculate token/s & GPU memory requirement for any LLM. Supports...

34
Emerging
83 yingding/applyllm

A python package for applying LLM with LangChain and Hugging Face on local...

33
Emerging
84 gunnarnordqvist/opencode-context-filter

Transparent HTTP proxy that automatically filters repository context for...

33
Emerging
85 AshishGautamX/K8s-LLM-Scheduler

An intelligent Kubernetes scheduler powered by Meta's Llama-3.3-70B model...

33
Emerging
86 psmarter/mini-infer

A high-performance LLM inference engine with PagedAttention |...

33
Emerging
87 winstxnhdw/llm-api

A fast CPU-based API for Qwen 2.5 using CTranslate2, hosted on Hugging Face Spaces.

32
Emerging
88 dengls24/LLM-para

Analyze LLM inference: FLOPs, memory, Roofline model. Supports GQA, MoE,...

32
Emerging
89 kennethleungty/DeepSeek-R1-Ollama-Simple-Evals

Run and Evaluate DeepSeek-R1 Distilled Models Locally with Ollama and...

32
Emerging
90 HyperMink/inferenceable

Scalable AI Inference Server for CPU and GPU with Node.js | Utilizes...

32
Emerging
91 tommasocerruti/detllm

Deterministic-mode checks for LLM inference: measure run/batch variance,...

32
Emerging
92 ybubnov/metalchat

Pure C++23 Llama inference for Apple Silicon chips

32
Emerging
93 Relaxed-System-Lab/HexGen

[ICML 2024] Serving LLMs on heterogeneous decentralized clusters.

31
Emerging
94 titanml/takeoff-community

TitanML Takeoff Server is an optimization, compression and deployment...

31
Emerging
95 bpevangelista/vfastml

Inference and Training Engine for LLMs, Image2Image and Other Models

31
Emerging
96 KevinLee1110/dynamic-batching

The official repo for the paper "Optimizing LLM Inference Throughput via...

31
Emerging
97 harleyszhang/llm_counts

llm theoretical performance analysis tools and support params, flops, memory...

31
Emerging
98 ToddThomson/Mila

Achilles Mila Deep Neural Network library provides a comprehensive API to...

30
Emerging
99 VPanjeta/PyLLaMa-CPU

Fast LLaMa inference on CPU using llama.cpp for Python

29
Experimental
100 BenChaliah/NVFP4-on-4090-vLLM

AdaLLM is an NVFP4-first inference runtime for Ada Lovelace (RTX 4090) with...

28
Experimental
101 changwoolee/BLAST

[NeurIPS 2024] BLAST: Block Level Adaptive Structured Matrix for Efficient...

27
Experimental
102 KarthikSriramGit/H.E.I.M.D.A.L.L

H.E.I.M.D.A.L.L looks at fleet telemetry and gives you natural-language...

27
Experimental
103 modelize-ai/LLM-Inference-Deployment-Tutorial

Tutorial for LLM developers about engine design, service deployment,...

27
Experimental
104 jmaczan/tiny-vllm

High performance LLM inference engine, a younger sibling of vLLM

26
Experimental
105 datvodinh/serve-llm

Serve high throughput and scalable LLM using Ray and vLLM

25
Experimental
106 dwain-barnes/LLM-GGUF-Auto-Converter

Automated Jupyter notebook solution for batch converting Large Language...

25
Experimental
107 HelpingAI/inferno

Run Llama 3.3, DeepSeek-R1, Phi-4, Gemma 3, Mistral Small 3.1, and other...

25
Experimental
108 EmbeddedLLM/embeddedllm

EmbeddedLLM: API server for Embedded Device Deployment. Currently support...

25
Experimental
109 nitrictech/pycasts

A text to Podcast inference API

24
Experimental
110 tensorchord/inference-benchmark

Benchmark for machine learning model online serving (LLM, embedding,...

24
Experimental
111 ictnlp/SiLLM

SiLLM is a Simultaneous Machine Translation (SiMT) Framework. It utilizes a...

23
Experimental
112 mjglatzmaier/llm-boostrap

Starter repo for running local LLM inference and lightweight benchmarking on...

22
Experimental
113 adamydwang/mobilellama

a lightweight C++ LLaMA inference engine for mobile devices

22
Experimental
114 rafaelmaza/llmfit-web

Find the best open-source LLM for your GPU/RAM - fit, speed & quality...

22
Experimental
115 AMD-AGI/gpt-fast

The GPT-Fast for Multimodal Models on AMD GPUs

22
Experimental
116 deepagency/llm-resource-planner

A simple CLI tool to fetch Hugging Face model metadata and estimate required...

22
Experimental
117 AntonioVFranco/elamonica

Production-ready test-time compute optimization framework for LLM inference....

22
Experimental
118 quantumnic/ssd-llm

Run 70B+ LLMs on Apple Silicon by using SSD as extended memory — intelligent...

22
Experimental
119 TeamADAPT/blitzkernels

BlitzKernels — production WASM inference kernels for edge AI (embedding,...

21
Experimental
120 llm-works/llm-infer

LLM inference server with native, vLLM, and Ollama backends, including a...

21
Experimental
121 iNeil77/vllm-code-harness

Run code inference-only benchmarks quickly using vLLM

21
Experimental
122 GPUforLLM/llm-vram-calculator

Accurate VRAM calculator for Local LLMs (Llama 4, DeepSeek V3, Qwen 2.5)....

21
Experimental
123 NEBUL-AI/HF-VRAM-Extension

VRAM calculator for Hugging Face models

21
Experimental
124 CornelisKuijpers/SIP-interface

Run 400B+ parameter AI models on consumer hardware with 12GB RAM

21
Experimental
125 landry-some/LLM-streaming

Efficient streaming inference for large language models (LLMs).

21
Experimental
126 liam8421/faster-llm

🚀 Accelerate LLM training with Fast-LLM, an open-source library for...

21
Experimental
127 onlychara553-debug/dgx-spark-inference-stack

🚀 Serve large language models efficiently at home with this Docker-based...

21
Experimental
128 MonitooDev/indiedroid-nova-llm

🚀 Benchmark local LLMs like Llama 3.1 on the Indiedroid Nova with RK3588...

21
Experimental
129 isshiki-dev/docker-model-runner

Self-hosted Anthropic API Compatible Inference Server with Claude Code...

20
Experimental
130 X-rayLaser/DistributedLLM

Run LLM inference by spliting models into parts and hosting each part on a...

20
Experimental
131 arkodeepsen/helix

Professional training stack for 100M parameter language models optimized for...

19
Experimental
132 getflexai/flex_ai

simplifies fine-tuning and inference for 60+ open-source LLMs through a single API

19
Experimental
133 eniompw/llama-cpp-gpu

Load larger models by offloading model layers to both GPU and CPU

19
Experimental
134 ThalesMMS/sglang-config

Configuration files and deployment scripts for serving Llama 3.2 3B and Qwen...

19
Experimental
135 Artemarius/CuInfer

From-scratch LLM inference engine in C++17/CUDA. Custom kernels, GGUF model...

19
Experimental
136 johnbrodowski/AutoInferenceBenchmark

AutoInferenceBenchmark is a Windows desktop application for evaluating and...

19
Experimental
137 EvanZhuang/rocm_tips

Tips for building and using DL packages for AMD ROCM

18
Experimental
138 di-osc/osc-llm

轻量级大模型推理引擎

17
Experimental
139 Scieries-Reunies-de-l-Est/llm

LLM deployment api of the Service Commercial company.

17
Experimental
140 darxkies/cpu-slm

A holiday project to better understand the inner workings of SLM/LLM.

17
Experimental
141 virtualramblas/DFloat11_MPS

DFloat11 for Apple Silicon.

17
Experimental
142 KT313/assistant_base

A custom framework for easy use of LLMs, VLMs, etc. supporting various modes...

17
Experimental
143 Alexyskoutnev/TurboInference

Welcome to TurboInference, a high-performance inference toolkit written in...

17
Experimental
144 piotrmaciejbednarski/llm-inference-tampering

Proof-of-concept for persistent manipulation of LLM outputs by modifying...

17
Experimental
145 Meahg/exvllm

🚀 Enhance vllm with exvllm to utilize MOE mixed inference, enabling...

17
Experimental
146 nikelborm/amd-amdgpu-rocm-ollama-gfx90c-ati-radeon-vega-ryzen7-5800H-arch-linux

Run Ollama on AMD Ryzen 7 5800H CPU with integrated GPU AMD ATI Radeon Vega...

15
Experimental
147 SunayHegde2006/Air.rs

Air.rs 70B+ inference on consumer GPU, LLM inference in Rust

15
Experimental
148 1337hero/rx7900xtx-llama-bench-rocm

Benchmark script for llama.cpp & results for AMD RX 7900 XTX

15
Experimental
149 rajatady/Inference-Stack

Production-grade LLM inference API built from scratch. NestJS gateway +...

14
Experimental
150 soy-tuber/localllama-insights

Technical insights from r/LocalLLaMA — vLLM, FP8, NVFP4, Blackwell GPU...

14
Experimental
151 Pyrolignic-paydirt84/pse-vcipher-collapse

Accelerate LLM inference by collapsing attention paths with...

14
Experimental
152 rick97julho/do-i-have-the-vram

🔍 Estimate your VRAM needs for Hugging Face models in seconds without...

14
Experimental
153 rinoScremin/Open_Cluster_AI_Station_beta

High-performance distributed matrix computation for AI workloads. Supports...

14
Experimental
154 vishvaRam/Docker-vLLM-Server-Builder-Runpod

Production-grade, OpenAI-compatible server using vLLM v0.17.0. Deploy LLMs,...

13
Experimental
155 karun2328/llm_serving_benchmarks

Benchmarking LLM inference serving with vLLM, analyzing latency, throughput,...

13
Experimental
156 virtualramblas/FlexLLMGenMPS

Running large language models on a single M1/M2 GPU for throughput-oriented...

13
Experimental
157 joeddav/illustrated-training-cluster

[WIP] Interactive visualization of LLM training parallelism across GPU clusters

13
Experimental
158 ZeeetOne/llm-inference-deployment

Practical example of deploying fine-tuned LLMs locally with FastAPI....

13
Experimental
159 G-B-KEVIN-ARJUN/runtime-inference

"Faster AI: Accelerating Qwen 2.5 from 7 t/s to 82 t/s on a single RTX 4060...

13
Experimental
160 biraj21/llm-server-from-scratch

FastAPI server for locally serving Gemma 3 270M & OpenAI Whisper with...

13
Experimental
161 adithya-s-k/LLM-InferenceNet

LLM InferenceNet is a C++ project designed to facilitate fast and efficient...

12
Experimental
162 keisuke-okb/llm-tokenwise-inference

Token-wise and real-time display Inference module for Llama2 and other LLMs.

11
Experimental
163 dae9999nam/LLM_C

This repository is to optimize the throughput of Large Language Model...

10
Experimental
164 hades255/benchmarking-llama_install-on-modular_max

Benchmark Inference Stack (vLLM vs Modular/MAX)

10
Experimental