andrewkchan/yalm

Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O

/ 100

Emerging

This is a C++/CUDA implementation for running large language models (LLMs) like Mistral on your own computer. It takes a pre-trained LLM's weights and configuration, converts them, and then lets you input prompts to receive generated text completions. It's designed for developers, researchers, or students who want to understand the underlying mechanics of LLM inference from scratch, rather than relying on existing libraries.

557 stars. No commits in the last 6 months.

Use this if you are a developer or researcher interested in understanding and experimenting with the low-level performance engineering of LLM inference on NVIDIA GPUs.

Not ideal if you need a production-ready system, a chat interface, or support for multiple GPUs or models beyond Mistral, Mixtral, and Llama-3.2.

AI-inference GPU-programming LLM-development performance-engineering CUDA-programming

No License Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 10 / 25

Maturity 8 / 25

Community 17 / 25

How are scores calculated?

Stars

557

Forks

Language

C++

License

—

Higher-rated alternatives

vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

sgl-project/sglang

SGLang is a high-performance serving framework for large language models and multimodal models.

alibaba/MNN

MNN: A blazing-fast, lightweight inference engine battle-tested by Alibaba, powering...

xorbitsai/inference

Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source,...

tensorzero/tensorzero

TensorZero is an open-source stack for industrial-grade LLM applications. It unifies an LLM...

Explore Transformer Models

All categories Trending Transformer directory Insights