GPU Parallel Programming ML Frameworks

Tutorials, guides, and implementations for GPU computing using CUDA and related parallel processing frameworks. Focuses on learning CUDA fundamentals, optimization techniques, and GPU-accelerated computing. Does NOT include ML applications built with GPUs, collective communication libraries, or physics simulations—only the programming language/platform itself.

There are 61 gpu parallel programming frameworks tracked. 3 score above 70 (verified tier). The highest-rated is iree-org/iree at 73/100 with 3,655 stars. 6 of the top 10 are actively maintained.

Get all 61 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=ml-frameworks&subcategory=gpu-parallel-programming&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Framework Score Tier
1 iree-org/iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.

73
Verified
2 brucefan1983/GPUMD

Graphics Processing Units Molecular Dynamics

73
Verified
3 uxlfoundation/oneDAL

oneAPI Data Analytics Library (oneDAL)

71
Verified
4 rapidsai/cuml

cuML - RAPIDS Machine Learning Library

69
Established
5 NVIDIA/cutlass

CUDA Templates and Python DSLs for High-Performance Linear Algebra

67
Established
6 ROCm/Tensile

[DEPRECATED] Moved to ROCm/rocm-libraries repo

64
Established
7 NVIDIA/nccl

Optimized primitives for collective multi-GPU communication

64
Established
8 openucx/ucc

Unified Collective Communication Library

61
Established
9 libxsmm/libxsmm

Library for specialized dense and sparse matrix operations, and deep...

61
Established
10 ROCm/hipBLASLt

[DEPRECATED] Moved to ROCm/rocm-libraries repo

60
Established
11 uxlfoundation/oneCCL

oneAPI Collective Communications Library (oneCCL)

60
Established
12 PaddleJitLab/CUDATutorial

A self-learning tutorail for CUDA High Performance Programing.

55
Established
13 google/gematria

Machine learning for machine code.

53
Established
14 mratsim/Arraymancer

A fast, ergonomic and portable tensor library in Nim with a deep learning...

50
Established
15 XiaoMi/mace

MACE is a deep learning inference framework optimized for mobile...

49
Emerging
16 NVIDIA/GMAT

A toolkit showing GPU's all-round capability in video processing

47
Emerging
17 cuMF/cumf_als

CUDA Matrix Factorization Library with Alternating Least Square (ALS)

47
Emerging
18 gorgonia/tensor

package tensor provides efficient and generic n-dimensional arrays in Go...

46
Emerging
19 srush/GPU-Puzzles

Solve puzzles. Learn CUDA.

45
Emerging
20 mc2-project/mc2

A Platform for Secure Analytics and Machine Learning

44
Emerging
21 MegEngine/MegCC

MegCC是一个运行时超轻量,高效,移植简单的深度学习模型编译器

44
Emerging
22 Edgecortix-Inc/mera

A Heterogeneous Platform Deep Learning Compiler Framework from EdgeCortix

44
Emerging
23 hshatti/Tensorium

A platform agnostic fast tensor manipulation library using SIMD when...

43
Emerging
24 OAID/TensorFlow-HRT

Heterogeneous Run Time version of TensorFlow. Added heterogeneous...

42
Emerging
25 bytedance/matxscript

A high-performance, extensible Python AOT compiler.

42
Emerging
26 MetaMachines/mm-ptx

PTX Inject and Stack PTX

40
Emerging
27 OutofAi/cudacanvas

Python Module for PyTorch Tensor Visualisation in CUDA Eliminating CPU Transfer

40
Emerging
28 google/nccl-fastsocket

NCCL Fast Socket is a transport layer plugin to improve NCCL collective...

40
Emerging
29 eedalong/ECE408

Code base and slides for ECE408:Applied Parallel Programming On GPU.

39
Emerging
30 Frikallo/axiom

High-performance C++ tensor library with NumPy/PyTorch-like API

38
Emerging
31 HenryNdubuaku/cuda-tutorials

Comprehensive CUDA tutorials for Maths & ML with examples.

37
Emerging
32 mratsim/laser

The HPC toolbox: fused matrix multiplication, convolution, data-parallel...

37
Emerging
33 wangsiping97/FastGEMV

High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.

36
Emerging
34 AXERA-TECH/ax-npu-kit-650

AI algorithm SDK based on AX650

34
Emerging
35 AndreSlavescu/meTile

python-based eDSL for efficient Metal Shading Language code generation

34
Emerging
36 openmlsys/openmlsys-cuda

Tutorials for writing high-performance GPU operators in AI frameworks.

32
Emerging
37 lawmurray/gpu-course

Deep neural network and Adam optimizer in straight C and CUDA. Accompanies...

31
Emerging
38 Abrahamduru/mHC.cu

🚀 Implement mHC using CUDA for efficient Manifold-Constrained...

30
Emerging
39 mikeroyal/OpenCL-Guide

OpenCL Guide

30
Emerging
40 priteshgohil/CUDA-programming-tutorial

Get started with CUDA programming

29
Experimental
41 SamerMakni/cuda-selector

A simple tool to select the optimal CUDA device based custom criteria.

29
Experimental
42 lcmialichi/php-cuda-ext

Direct NVIDIA CUDA access for PHP. GPU-accelerated tensors, JIT-compiled...

28
Experimental
43 NumPower/numpower-autograd

High performance PHP tensor with autograd (automatic differentiation) and...

28
Experimental
44 Venkat2811/yali

Speed-of-Light SW efficiency by using ultra low-latency primitives for comms...

28
Experimental
45 gabrielmaialva33/viva_tensor

Pure Gleam tensor library with quantization (INT8, NF4, AWQ), Flash...

23
Experimental
46 mrpottermusic/nccl-mesh-plugin

🌐 Enable distributed ML with the NCCL Mesh Plugin for efficient...

22
Experimental
47 realies/microgpt.c

Karpathy's microgpt.py, in C

22
Experimental
48 porosh656/cuPDLPx

🚀 Accelerate your linear programming with cuPDLPx, a GPU-based solver that...

21
Experimental
49 muhamadsafii-21/cutile-learn

🚀 Learn efficient CUDA programming with cuTile through hands-on tutorials...

21
Experimental
50 LessUp/cuda-kernel-academy

CUDA Kernel Optimization Academy: SGEMM Tutorial, TensorCraft Ops, HPC...

21
Experimental
51 LessUp/hpc-ai-optimization-lab

CUDA HPC Kernel Optimization Textbook: Naive to Tensor Core — GEMM,...

21
Experimental
52 nageshnnazare/cuda-know-hows

cuda related stuff

20
Experimental
53 Duconnor/Pudding

This is the official repository for the project Pudding. Pudding enables you...

19
Experimental
54 ProjectoOfficial/CUDA

Learn cuda step-by-step starting from 0 with these simple and free code...

18
Experimental
55 karton3c/kuda

my custom open-source programing language

17
Experimental
56 Pects1949/Cpp-Distributed-ML-Framework

A C++ framework for distributed machine learning training, focusing on...

14
Experimental
57 aksayush2005/project-compiled

A Mini Machine Learning Compiler with Hardware-Aware Optimization

14
Experimental
58 garrettkinman/SteadyTensor

An ultra-light, ultra-flexible tensor library written in pure Nim. Intended...

13
Experimental
59 rurumimic/cuda

compute unified device architecture

13
Experimental
60 camarababa/cuda-mastery-guide

🚀 Master CUDA programming with structured lessons covering fundamentals,...

13
Experimental
61 takielias/php-tensor-for-windows

PHP Tensor Extension for Windows https://github.com/RubixML/Tensor

13
Experimental

Comparisons in this category