izmttk/ullm

Lightweight LLM inference engine inspired by nano-vllm, with radix-tree based prefix cache, tp & pp, cuda graph, openai api, async scheduling, and more.

30
/ 100
Emerging

This project offers a high-performance engine for serving large language models (LLMs) like Qwen3-0.6B. It takes text prompts and generates creative or informative text completions, similar to how ChatGPT works. This tool is for developers and MLOps engineers who are building applications that use LLMs and need to serve them efficiently to many users.

Use this if you are a developer or MLOps engineer deploying LLMs and need a fast, scalable solution to handle multiple user requests with an OpenAI-compatible API.

Not ideal if you are a general user looking for a pre-built chatbot or a simple script to run LLMs on your personal computer without needing to optimize for throughput.

LLM-deployment MLOps API-development AI-application-serving inference-optimization
No Package No Dependents
Maintenance 10 / 25
Adoption 5 / 25
Maturity 15 / 25
Community 0 / 25

How are scores calculated?

Stars

9

Forks

Language

Python

License

MIT

Last pushed

Mar 10, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/izmttk/ullm"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.