izmttk/ullm

Lightweight LLM inference engine inspired by nano-vllm, with radix-tree based prefix cache, tp & pp, cuda graph, openai api, async scheduling, and more.

/ 100

Emerging

This project offers a high-performance engine for serving large language models (LLMs) like Qwen3-0.6B. It takes text prompts and generates creative or informative text completions, similar to how ChatGPT works. This tool is for developers and MLOps engineers who are building applications that use LLMs and need to serve them efficiently to many users.

Use this if you are a developer or MLOps engineer deploying LLMs and need a fast, scalable solution to handle multiple user requests with an OpenAI-compatible API.

Not ideal if you are a general user looking for a pre-built chatbot or a simple script to run LLMs on your personal computer without needing to optimize for throughput.

LLM-deployment MLOps API-development AI-application-serving inference-optimization

No Package No Dependents

Maintenance 10 / 25

Adoption 5 / 25

Maturity 15 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

Python

License

MIT

Higher-rated alternatives

Goekdeniz-Guelmez/mlx-lm-lora

Train Large Language Models on MLX.

uber-research/PPLM

Plug and Play Language Model implementation. Allows to steer topic and attributes of GPT-2 models.

VHellendoorn/Code-LMs

Guide to using pre-trained large language models of source code

ssbuild/chatglm_finetuning

chatglm 6b finetuning and alpaca finetuning

jarobyte91/pytorch_beam_search

A lightweight implementation of Beam Search for sequence models in PyTorch.

Explore Transformer Models

All categories Trending Transformer directory Insights