zejia-lin/BulletServe
Boosting GPU utilization for LLM serving via dynamic spatial-temporal prefill & decode orchestration
When deploying large language models (LLMs), BulletServe significantly boosts the efficiency of your GPU hardware. It intelligently manages how your GPU processes incoming user prompts (prefill) and generates responses (decode), allowing them to run at the same time. This results in faster response times and the ability to handle more users concurrently, making it ideal for engineers and MLOps professionals managing LLM inference services.
Use this if you are running LLMs in production and want to maximize throughput and minimize latency on your existing GPU infrastructure.
Not ideal if you are looking for a general-purpose LLM serving framework with broad feature parity, as this is a specialized research prototype focused on performance optimization.
Stars
37
Forks
5
Language
Python
License
Apache-2.0
Category
Last pushed
Jan 08, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/zejia-lin/BulletServe"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
sgl-project/sglang
SGLang is a high-performance serving framework for large language models and multimodal models.
alibaba/MNN
MNN: A blazing-fast, lightweight inference engine battle-tested by Alibaba, powering...
xorbitsai/inference
Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source,...
tensorzero/tensorzero
TensorZero is an open-source stack for industrial-grade LLM applications. It unifies an LLM...