gitctrlx/llama.cu
Llama from scratch in CUDA with Flash Attention.
This project helps developers deeply understand and implement high-performance large language model (LLM) inference using NVIDIA GPUs. It takes LLaMA model weights and runs them directly on the GPU, outputting text generation results. This is designed for experienced software engineers and GPU programmers who want to learn the low-level mechanics of LLMs and CUDA.
Use this if you are a C++/CUDA developer keen to learn how to implement LLM inference directly on a GPU for maximum performance and educational insight.
Not ideal if you're a data scientist or machine learning practitioner looking for a high-level API to run LLMs, without diving into GPU programming.
Stars
43
Forks
3
Language
Cuda
License
MIT
Category
Last pushed
Oct 22, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/gitctrlx/llama.cu"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
ModelCloud/GPTQModel
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD...
intel/auto-round
🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality...
pytorch/ao
PyTorch native quantization and sparsity for training and inference
bodaay/HuggingFaceModelDownloader
Simple go utility to download HuggingFace Models and Datasets
NVIDIA/kvpress
LLM KV cache compression made easy