LLM CUDA Optimization LLM Tools
Low-level CUDA kernel development, GPU memory optimization, and hardware-accelerated inference engines for LLMs. Includes custom GEMM implementations, tensor operations, quantization kernels, and distributed inference backends. Does NOT include high-level inference frameworks, application layers, or non-GPU acceleration methods.
There are 19 llm cuda optimization tools tracked. 2 score above 50 (established tier). The highest-rated is ggml-org/ggml at 68/100 with 14,217 stars. 1 of the top 10 are actively maintained.
Get all 19 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=llm-cuda-optimization&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
ggml-org/ggml
Tensor library for machine learning |
|
Established |
| 2 |
onnx/ir-py
Efficient in-memory representation for ONNX, in Python |
|
Established |
| 3 |
SandAI-org/MagiCompiler
A plug-and-play compiler that delivers free-lunch optimizations for both... |
|
Emerging |
| 4 |
R-D-BioTech-Alaska/Qelm
Qelm - Quantum Enhanced Language Model |
|
Emerging |
| 5 |
bytedance/lightseq
LightSeq: A High Performance Library for Sequence Processing and Generation |
|
Emerging |
| 6 |
dongchany/ember
A lightweight multi-GPU inference engine for LLMs on mid/low-end GPUs. |
|
Emerging |
| 7 |
kekzl/imp
High-performance LLM inference engine in C++/CUDA for NVIDIA Blackwell GPUs... |
|
Emerging |
| 8 |
rockyco/OpenWLAN
AI-powered MATLAB-to-HLS framework for WLAN 802.11 synchronization. 3.88x... |
|
Emerging |
| 9 |
jjang-ai/jangq
JANG — GGUF for MLX. YOU MUST USE JANG_Q RUNTIME. Adaptive Mixed-Precision... |
|
Emerging |
| 10 |
rockyco/peakPicker
A Comprehensive Comparative Study of LLM-Aided FPGA Design Flow |
|
Experimental |
| 11 |
artalis-io/bitnet.c
Minimal, embeddable LLM inference engine in pure C11. 20+ GGUF quant... |
|
Experimental |
| 12 |
rockyco/ImageProcessing
LLM-Aided FPGA Design Optimization |
|
Experimental |
| 13 |
saleembarakat4/viva_tensor
🚀 Accelerate your computations with viva_tensor, the fastest tensor library... |
|
Experimental |
| 14 |
mtmatheuus/QKV-Core
🚀 Run modern 7B LLMs on legacy 4GB GPUs without crashes, breaking the VRAM... |
|
Experimental |
| 15 |
deependujha/DeepTensor
DeepTensor: A minimal PyTorch-like deep learning library focused on custom... |
|
Experimental |
| 16 |
0xnu/qrme
qrme is a quantum-resistant encrypted machine learning system designed to... |
|
Experimental |
| 17 |
LessUp/tiny-llm
Lightweight LLM Inference Engine (CUDA C++17): W8A16 Quantization, KV Cache... |
|
Experimental |
| 18 |
ProCoder1199X/NanoAccel
Python Library for inference of LLMs on low end hardware and CPU optimizations |
|
Experimental |
| 19 |
K-Wu/intrasm_engine
Enhancing CUDA Intra-Streaming-Multiprocessor Parallelism for Large Language... |
|
Experimental |