zhihu/ZhiLight

A highly optimized LLM inference acceleration engine for Llama and its variants.

/ 100

Established

ZhiLight is a specialized engine designed to speed up the process of generating text from large language models (LLMs) like Llama and its variants. It takes your trained LLM and, by optimizing how the model runs on NVIDIA GPUs, delivers faster responses and more outputs per second. This tool is for AI engineers or machine learning operations specialists who deploy and manage LLMs in production.

905 stars. Actively maintained with 4 commits in the last 30 days.

Use this if you need to accelerate the performance of your Llama-based language models, especially on PCIe-based NVIDIA GPUs, to handle more user requests or reduce response times.

Not ideal if your LLM infrastructure does not primarily use NVIDIA GPUs or if you are not deploying Llama or similar models.

LLM deployment AI infrastructure GPU optimization model serving MLOps

No Package No Dependents

Maintenance 13 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 20 / 25

How are scores calculated?

Stars

905

Forks

102

Language

C++

License

Apache-2.0

Compare

ZhiLight and vllm ZhiLight and rtp-llm ZhiLight and xllm ZhiLight and LightLLM

Related models

vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

sgl-project/sglang

SGLang is a high-performance serving framework for large language models and multimodal models.

alibaba/MNN

MNN: A blazing-fast, lightweight inference engine battle-tested by Alibaba, powering...

xorbitsai/inference

Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source,...

tensorzero/tensorzero

TensorZero is an open-source stack for industrial-grade LLM applications. It unifies an LLM...

Explore Transformer Models

All categories Trending Transformer directory Insights