zhihu/ZhiLight
A highly optimized LLM inference acceleration engine for Llama and its variants.
ZhiLight is a specialized engine designed to speed up the process of generating text from large language models (LLMs) like Llama and its variants. It takes your trained LLM and, by optimizing how the model runs on NVIDIA GPUs, delivers faster responses and more outputs per second. This tool is for AI engineers or machine learning operations specialists who deploy and manage LLMs in production.
905 stars. Actively maintained with 4 commits in the last 30 days.
Use this if you need to accelerate the performance of your Llama-based language models, especially on PCIe-based NVIDIA GPUs, to handle more user requests or reduce response times.
Not ideal if your LLM infrastructure does not primarily use NVIDIA GPUs or if you are not deploying Llama or similar models.
Stars
905
Forks
102
Language
C++
License
Apache-2.0
Category
Last pushed
Mar 11, 2026
Commits (30d)
4
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/zhihu/ZhiLight"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related models
vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
sgl-project/sglang
SGLang is a high-performance serving framework for large language models and multimodal models.
alibaba/MNN
MNN: A blazing-fast, lightweight inference engine battle-tested by Alibaba, powering...
xorbitsai/inference
Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source,...
tensorzero/tensorzero
TensorZero is an open-source stack for industrial-grade LLM applications. It unifies an LLM...