NVIDIA/cutlass

CUDA Templates and Python DSLs for High-Performance Linear Algebra

/ 100

Established

This project provides specialized tools for developers to create highly optimized linear algebra operations, particularly for matrix-matrix multiplication (GEMM), on NVIDIA GPUs. It takes in computational definitions and data types, and outputs high-performance CUDA kernels. Researchers, performance engineers, and students working on GPU programming for numerical applications would find this useful.

9,426 stars. Actively maintained with 10 commits in the last 30 days.

Use this if you need to develop custom, extremely fast GPU kernels for linear algebra, especially matrix multiplications, using a more accessible Python interface or traditional C++ templates.

Not ideal if you are an end-user simply looking to run existing machine learning models or use standard data science libraries without writing custom GPU code.

GPU programming High-performance computing Numerical optimization Deep learning infrastructure CUDA kernel development

No Package No Dependents

Maintenance 17 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 24 / 25

How are scores calculated?

Stars

9,426

Forks

1,725

Language

C++

License

—

Recent Releases

v4.4.2 17 Mar 2026 v4.4.1 28 Feb 2026 v4.4.0 26 Feb 2026 v4.3.5 09 Jan 2026 v4.3.4 24 Dec 2025

Related frameworks

iree-org/iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.

brucefan1983/GPUMD

Graphics Processing Units Molecular Dynamics

uxlfoundation/oneDAL

oneAPI Data Analytics Library (oneDAL)

rapidsai/cuml

cuML - RAPIDS Machine Learning Library

ROCm/Tensile

[DEPRECATED] Moved to ROCm/rocm-libraries repo

Explore ML Frameworks

All categories Trending ML Framework directory Insights