Venkat2811/yali
Speed-of-Light SW efficiency by using ultra low-latency primitives for comms collectives
This project offers an optimized library for speeding up data exchange between two NVIDIA GPUs connected by NVLink. It takes in arrays of numerical data on two GPUs and efficiently combines them, then broadcasts the result back to both GPUs. High-performance computing engineers or researchers working with GPU-accelerated workloads will find this useful for reducing the time spent on collective communication operations.
Use this if you are running computationally intensive tasks that involve frequent data aggregation (like "AllReduce" operations) between exactly two NVLink-connected NVIDIA GPUs and you need faster communication with more consistent performance than standard libraries.
Not ideal if your setup involves more than two GPUs, if your GPUs are not connected via NVLink, or if you need to perform collective operations across multiple compute nodes.
Stars
13
Forks
—
Language
Cuda
License
MIT
Category
Last pushed
Jan 22, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/Venkat2811/yali"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
brucefan1983/GPUMD
Graphics Processing Units Molecular Dynamics
iree-org/iree
A retargetable MLIR-based machine learning compiler and runtime toolkit.
uxlfoundation/oneDAL
oneAPI Data Analytics Library (oneDAL)
rapidsai/cuml
cuML - RAPIDS Machine Learning Library
NVIDIA/cutlass
CUDA Templates and Python DSLs for High-Performance Linear Algebra