uccl-project/uccl
UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)
This project dramatically speeds up the communication between Graphics Processing Units (GPUs) when running demanding AI and machine learning tasks. It takes your existing machine learning code, which uses standard communication methods like NCCL or RCCL, and processes data much faster between GPUs, delivering significantly quicker training times. Data scientists, machine learning engineers, and AI researchers working with large-scale GPU clusters will find this useful for accelerating model training and distributed computations.
1,234 stars. Actively maintained with 58 commits in the last 30 days.
Use this if you are running large-scale distributed machine learning workloads on multiple GPUs and need to drastically improve data transfer speed and overall training efficiency.
Not ideal if your machine learning tasks run on a single GPU or if you are not experiencing communication bottlenecks across your GPU cluster.
Stars
1,234
Forks
128
Language
C++
License
Apache-2.0
Category
Last pushed
Mar 13, 2026
Commits (30d)
58
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/uccl-project/uccl"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
vllm-project/vllm-ascend
Community maintained hardware plugin for vLLM on Ascend
SemiAnalysisAI/InferenceX
Open Source Continuous Inference Benchmarking Qwen3.5, DeepSeek, GPTOSS - GB200 NVL72 vs MI355X...
kvcache-ai/Mooncake
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
sophgo/tpu-mlir
Machine learning compiler based on MLIR for Sophgo TPU.
BBuf/how-to-optim-algorithm-in-cuda
how to optimize some algorithm in cuda.