izmttk/ullm
Lightweight LLM inference engine inspired by nano-vllm, with radix-tree based prefix cache, tp & pp, cuda graph, openai api, async scheduling, and more.
This project offers a high-performance engine for serving large language models (LLMs) like Qwen3-0.6B. It takes text prompts and generates creative or informative text completions, similar to how ChatGPT works. This tool is for developers and MLOps engineers who are building applications that use LLMs and need to serve them efficiently to many users.
Use this if you are a developer or MLOps engineer deploying LLMs and need a fast, scalable solution to handle multiple user requests with an OpenAI-compatible API.
Not ideal if you are a general user looking for a pre-built chatbot or a simple script to run LLMs on your personal computer without needing to optimize for throughput.
Stars
9
Forks
—
Language
Python
License
MIT
Category
Last pushed
Mar 10, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/izmttk/ullm"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
Goekdeniz-Guelmez/mlx-lm-lora
Train Large Language Models on MLX.
uber-research/PPLM
Plug and Play Language Model implementation. Allows to steer topic and attributes of GPT-2 models.
VHellendoorn/Code-LMs
Guide to using pre-trained large language models of source code
ssbuild/chatglm_finetuning
chatglm 6b finetuning and alpaca finetuning
jarobyte91/pytorch_beam_search
A lightweight implementation of Beam Search for sequence models in PyTorch.