gitctrlx/llama.cu

Llama from scratch in CUDA with Flash Attention.

/ 100

Emerging

This project helps developers deeply understand and implement high-performance large language model (LLM) inference using NVIDIA GPUs. It takes LLaMA model weights and runs them directly on the GPU, outputting text generation results. This is designed for experienced software engineers and GPU programmers who want to learn the low-level mechanics of LLMs and CUDA.

Use this if you are a C++/CUDA developer keen to learn how to implement LLM inference directly on a GPU for maximum performance and educational insight.

Not ideal if you're a data scientist or machine learning practitioner looking for a high-level API to run LLMs, without diving into GPU programming.

GPU programming CUDA development High-performance computing Machine learning engineering LLM inference

No Package No Dependents

Maintenance 6 / 25

Adoption 8 / 25

Maturity 13 / 25

Community 7 / 25

How are scores calculated?

Stars

Forks

Language

Cuda

License

MIT

Higher-rated alternatives

ModelCloud/GPTQModel

LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD...

intel/auto-round

🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality...

pytorch/ao

PyTorch native quantization and sparsity for training and inference

bodaay/HuggingFaceModelDownloader

Simple go utility to download HuggingFace Models and Datasets

NVIDIA/kvpress

LLM KV cache compression made easy

Explore Transformer Models

All categories Trending Transformer directory Insights