andrewkchan/yalm
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
This is a C++/CUDA implementation for running large language models (LLMs) like Mistral on your own computer. It takes a pre-trained LLM's weights and configuration, converts them, and then lets you input prompts to receive generated text completions. It's designed for developers, researchers, or students who want to understand the underlying mechanics of LLM inference from scratch, rather than relying on existing libraries.
557 stars. No commits in the last 6 months.
Use this if you are a developer or researcher interested in understanding and experimenting with the low-level performance engineering of LLM inference on NVIDIA GPUs.
Not ideal if you need a production-ready system, a chat interface, or support for multiple GPUs or models beyond Mistral, Mixtral, and Llama-3.2.
Stars
557
Forks
56
Language
C++
License
—
Category
Last pushed
Sep 13, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/andrewkchan/yalm"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
sgl-project/sglang
SGLang is a high-performance serving framework for large language models and multimodal models.
alibaba/MNN
MNN: A blazing-fast, lightweight inference engine battle-tested by Alibaba, powering...
xorbitsai/inference
Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source,...
tensorzero/tensorzero
TensorZero is an open-source stack for industrial-grade LLM applications. It unifies an LLM...