uccl-project/uccl

UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)

/ 100

Established

This project dramatically speeds up the communication between Graphics Processing Units (GPUs) when running demanding AI and machine learning tasks. It takes your existing machine learning code, which uses standard communication methods like NCCL or RCCL, and processes data much faster between GPUs, delivering significantly quicker training times. Data scientists, machine learning engineers, and AI researchers working with large-scale GPU clusters will find this useful for accelerating model training and distributed computations.

1,234 stars. Actively maintained with 58 commits in the last 30 days.

Use this if you are running large-scale distributed machine learning workloads on multiple GPUs and need to drastically improve data transfer speed and overall training efficiency.

Not ideal if your machine learning tasks run on a single GPU or if you are not experiencing communication bottlenecks across your GPU cluster.

distributed-machine-learning GPU-acceleration deep-learning-training AI-infrastructure high-performance-computing

No Package No Dependents

Maintenance 22 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 20 / 25

How are scores calculated?

Stars

1,234

Forks

128

Language

C++

License

Apache-2.0

Related tools

vllm-project/vllm-ascend

Community maintained hardware plugin for vLLM on Ascend

SemiAnalysisAI/InferenceX

Open Source Continuous Inference Benchmarking Qwen3.5, DeepSeek, GPTOSS - GB200 NVL72 vs MI355X...

kvcache-ai/Mooncake

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

sophgo/tpu-mlir

Machine learning compiler based on MLIR for Sophgo TPU.

BBuf/how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

Explore LLM Tools

All categories Trending LLM Tool directory Insights