gitctrlx/llama.cu

Llama from scratch in CUDA with Flash Attention.

34
/ 100
Emerging

This project helps developers deeply understand and implement high-performance large language model (LLM) inference using NVIDIA GPUs. It takes LLaMA model weights and runs them directly on the GPU, outputting text generation results. This is designed for experienced software engineers and GPU programmers who want to learn the low-level mechanics of LLMs and CUDA.

Use this if you are a C++/CUDA developer keen to learn how to implement LLM inference directly on a GPU for maximum performance and educational insight.

Not ideal if you're a data scientist or machine learning practitioner looking for a high-level API to run LLMs, without diving into GPU programming.

GPU programming CUDA development High-performance computing Machine learning engineering LLM inference
No Package No Dependents
Maintenance 6 / 25
Adoption 8 / 25
Maturity 13 / 25
Community 7 / 25

How are scores calculated?

Stars

43

Forks

3

Language

Cuda

License

MIT

Last pushed

Oct 22, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/gitctrlx/llama.cu"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.