NVIDIA/nccl
Optimized primitives for collective multi-GPU communication
This library helps high-performance computing developers efficiently move data between multiple GPUs. It enables faster training of large models or processing of big datasets by optimizing communication routines like all-reduce and broadcast. The input is data distributed across several GPUs, and the output is the same data aggregated or redistributed efficiently, making it ideal for engineers building deep learning frameworks or scientific simulation software.
4,521 stars. Actively maintained with 1 commit in the last 30 days.
Use this if you are a system architect or developer building applications that require fast, collective data transfers between multiple GPUs, either within a single server or across a cluster.
Not ideal if you are a data scientist primarily using high-level deep learning frameworks without needing to optimize low-level GPU communication primitives.
Stars
4,521
Forks
1,158
Language
C++
License
—
Category
Last pushed
Mar 08, 2026
Commits (30d)
1
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/NVIDIA/nccl"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related frameworks
iree-org/iree
A retargetable MLIR-based machine learning compiler and runtime toolkit.
brucefan1983/GPUMD
Graphics Processing Units Molecular Dynamics
uxlfoundation/oneDAL
oneAPI Data Analytics Library (oneDAL)
rapidsai/cuml
cuML - RAPIDS Machine Learning Library
NVIDIA/cutlass
CUDA Templates and Python DSLs for High-Performance Linear Algebra