LLM CUDA Optimization LLM Tools

Low-level CUDA kernel development, GPU memory optimization, and hardware-accelerated inference engines for LLMs. Includes custom GEMM implementations, tensor operations, quantization kernels, and distributed inference backends. Does NOT include high-level inference frameworks, application layers, or non-GPU acceleration methods.

There are 19 llm cuda optimization tools tracked. 2 score above 50 (established tier). The highest-rated is ggml-org/ggml at 68/100 with 14,217 stars. 1 of the top 10 are actively maintained.

Get all 19 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=llm-cuda-optimization&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	ggml-org/ggml Tensor library for machine learning	68	Established	14,217	C++
2	onnx/ir-py Efficient in-memory representation for ONNX, in Python	52	Established	43	Python
3	SandAI-org/MagiCompiler A plug-and-play compiler that delivers free-lunch optimizations for both...	49	Emerging	234	Python
4	R-D-BioTech-Alaska/Qelm Qelm - Quantum Enhanced Language Model	49	Emerging	25	Python
5	bytedance/lightseq LightSeq: A High Performance Library for Sequence Processing and Generation	46	Emerging	3,304	C++
6	dongchany/ember A lightweight multi-GPU inference engine for LLMs on mid/low-end GPUs.	37	Emerging	6	C++
7	kekzl/imp High-performance LLM inference engine in C++/CUDA for NVIDIA Blackwell GPUs...	37	Emerging	15	Cuda
8	rockyco/OpenWLAN AI-powered MATLAB-to-HLS framework for WLAN 802.11 synchronization. 3.88x...	30	Emerging	7	Verilog
9	jjang-ai/jangq JANG — GGUF for MLX. YOU MUST USE JANG_Q RUNTIME. Adaptive Mixed-Precision...	30	Emerging	58	Python
10	rockyco/peakPicker A Comprehensive Comparative Study of LLM-Aided FPGA Design Flow	29	Experimental	10	HTML
11	artalis-io/bitnet.c Minimal, embeddable LLM inference engine in pure C11. 20+ GGUF quant...	28	Experimental	5	C
12	rockyco/ImageProcessing LLM-Aided FPGA Design Optimization	25	Experimental	16	C
13	saleembarakat4/viva_tensor 🚀 Accelerate your computations with viva_tensor, the fastest tensor library...	22	Experimental	—	Gleam
14	mtmatheuus/QKV-Core 🚀 Run modern 7B LLMs on legacy 4GB GPUs without crashes, breaking the VRAM...	22	Experimental	—	Python
15	deependujha/DeepTensor DeepTensor: A minimal PyTorch-like deep learning library focused on custom...	19	Experimental	3	C++
16	0xnu/qrme qrme is a quantum-resistant encrypted machine learning system designed to...	19	Experimental	3	C
17	LessUp/tiny-llm Lightweight LLM Inference Engine (CUDA C++17): W8A16 Quantization, KV Cache...	19	Experimental	—	Cuda
18	ProCoder1199X/NanoAccel Python Library for inference of LLMs on low end hardware and CPU optimizations	17	Experimental	—	Python
19	K-Wu/intrasm_engine Enhancing CUDA Intra-Streaming-Multiprocessor Parallelism for Large Language...	13	Experimental	—	Jupyter Notebook