Venkat2811/yali

Speed-of-Light SW efficiency by using ultra low-latency primitives for comms collectives

/ 100

Experimental

This project offers an optimized library for speeding up data exchange between two NVIDIA GPUs connected by NVLink. It takes in arrays of numerical data on two GPUs and efficiently combines them, then broadcasts the result back to both GPUs. High-performance computing engineers or researchers working with GPU-accelerated workloads will find this useful for reducing the time spent on collective communication operations.

Use this if you are running computationally intensive tasks that involve frequent data aggregation (like "AllReduce" operations) between exactly two NVLink-connected NVIDIA GPUs and you need faster communication with more consistent performance than standard libraries.

Not ideal if your setup involves more than two GPUs, if your GPUs are not connected via NVLink, or if you need to perform collective operations across multiple compute nodes.

GPU-accelerated computing High-performance computing Parallel processing Deep learning infrastructure Scientific simulation

No Package No Dependents

Maintenance 10 / 25

Adoption 5 / 25

Maturity 13 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

Cuda

License

MIT

Higher-rated alternatives

brucefan1983/GPUMD

Graphics Processing Units Molecular Dynamics

iree-org/iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.

uxlfoundation/oneDAL

oneAPI Data Analytics Library (oneDAL)

rapidsai/cuml

cuML - RAPIDS Machine Learning Library

NVIDIA/cutlass

CUDA Templates and Python DSLs for High-Performance Linear Algebra

Explore ML Frameworks

All categories Trending ML Framework directory Insights