LLM CUDA Optimization LLM Tools

Low-level CUDA kernel development, GPU memory optimization, and hardware-accelerated inference engines for LLMs. Includes custom GEMM implementations, tensor operations, quantization kernels, and distributed inference backends. Does NOT include high-level inference frameworks, application layers, or non-GPU acceleration methods.

There are 19 llm cuda optimization tools tracked. 2 score above 50 (established tier). The highest-rated is ggml-org/ggml at 68/100 with 14,217 stars. 1 of the top 10 are actively maintained.

Get all 19 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=llm-cuda-optimization&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 ggml-org/ggml

Tensor library for machine learning

68
Established
2 onnx/ir-py

Efficient in-memory representation for ONNX, in Python

52
Established
3 SandAI-org/MagiCompiler

A plug-and-play compiler that delivers free-lunch optimizations for both...

49
Emerging
4 R-D-BioTech-Alaska/Qelm

Qelm - Quantum Enhanced Language Model

49
Emerging
5 bytedance/lightseq

LightSeq: A High Performance Library for Sequence Processing and Generation

46
Emerging
6 dongchany/ember

A lightweight multi-GPU inference engine for LLMs on mid/low-end GPUs.

37
Emerging
7 kekzl/imp

High-performance LLM inference engine in C++/CUDA for NVIDIA Blackwell GPUs...

37
Emerging
8 rockyco/OpenWLAN

AI-powered MATLAB-to-HLS framework for WLAN 802.11 synchronization. 3.88x...

30
Emerging
9 jjang-ai/jangq

JANG — GGUF for MLX. YOU MUST USE JANG_Q RUNTIME. Adaptive Mixed-Precision...

30
Emerging
10 rockyco/peakPicker

A Comprehensive Comparative Study of LLM-Aided FPGA Design Flow

29
Experimental
11 artalis-io/bitnet.c

Minimal, embeddable LLM inference engine in pure C11. 20+ GGUF quant...

28
Experimental
12 rockyco/ImageProcessing

LLM-Aided FPGA Design Optimization

25
Experimental
13 saleembarakat4/viva_tensor

🚀 Accelerate your computations with viva_tensor, the fastest tensor library...

22
Experimental
14 mtmatheuus/QKV-Core

🚀 Run modern 7B LLMs on legacy 4GB GPUs without crashes, breaking the VRAM...

22
Experimental
15 deependujha/DeepTensor

DeepTensor: A minimal PyTorch-like deep learning library focused on custom...

19
Experimental
16 0xnu/qrme

qrme is a quantum-resistant encrypted machine learning system designed to...

19
Experimental
17 LessUp/tiny-llm

Lightweight LLM Inference Engine (CUDA C++17): W8A16 Quantization, KV Cache...

19
Experimental
18 ProCoder1199X/NanoAccel

Python Library for inference of LLMs on low end hardware and CPU optimizations

17
Experimental
19 K-Wu/intrasm_engine

Enhancing CUDA Intra-Streaming-Multiprocessor Parallelism for Large Language...

13
Experimental