andrewkchan/yalm

Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O

37
/ 100
Emerging

This is a C++/CUDA implementation for running large language models (LLMs) like Mistral on your own computer. It takes a pre-trained LLM's weights and configuration, converts them, and then lets you input prompts to receive generated text completions. It's designed for developers, researchers, or students who want to understand the underlying mechanics of LLM inference from scratch, rather than relying on existing libraries.

557 stars. No commits in the last 6 months.

Use this if you are a developer or researcher interested in understanding and experimenting with the low-level performance engineering of LLM inference on NVIDIA GPUs.

Not ideal if you need a production-ready system, a chat interface, or support for multiple GPUs or models beyond Mistral, Mixtral, and Llama-3.2.

AI-inference GPU-programming LLM-development performance-engineering CUDA-programming
No License Stale 6m No Package No Dependents
Maintenance 2 / 25
Adoption 10 / 25
Maturity 8 / 25
Community 17 / 25

How are scores calculated?

Stars

557

Forks

56

Language

C++

License

Last pushed

Sep 13, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/andrewkchan/yalm"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.