NVIDIA/cutlass
CUDA Templates and Python DSLs for High-Performance Linear Algebra
This project provides specialized tools for developers to create highly optimized linear algebra operations, particularly for matrix-matrix multiplication (GEMM), on NVIDIA GPUs. It takes in computational definitions and data types, and outputs high-performance CUDA kernels. Researchers, performance engineers, and students working on GPU programming for numerical applications would find this useful.
9,426 stars. Actively maintained with 10 commits in the last 30 days.
Use this if you need to develop custom, extremely fast GPU kernels for linear algebra, especially matrix multiplications, using a more accessible Python interface or traditional C++ templates.
Not ideal if you are an end-user simply looking to run existing machine learning models or use standard data science libraries without writing custom GPU code.
Stars
9,426
Forks
1,725
Language
C++
License
—
Category
Last pushed
Mar 12, 2026
Commits (30d)
10
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/NVIDIA/cutlass"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Recent Releases
Related frameworks
iree-org/iree
A retargetable MLIR-based machine learning compiler and runtime toolkit.
brucefan1983/GPUMD
Graphics Processing Units Molecular Dynamics
uxlfoundation/oneDAL
oneAPI Data Analytics Library (oneDAL)
rapidsai/cuml
cuML - RAPIDS Machine Learning Library
ROCm/Tensile
[DEPRECATED] Moved to ROCm/rocm-libraries repo