alibaba/TePDist

TePDist (TEnsor Program DISTributed) is an HLO-level automatic distributed system for DL models.

/ 100

Emerging

TePDist is an infrastructure that automatically distributes the training of large deep learning models across multiple GPUs or machines. It takes a model's computational graph (in XLA HLO format) and figures out the best way to split the work, then manages the distributed training. Data scientists and AI researchers working with massive neural networks will find this useful for speeding up their training processes.

No commits in the last 6 months.

Use this if you need to train very large deep learning models efficiently across multiple GPUs or servers without manually configuring complex parallelization strategies.

Not ideal if you are working with small models that train quickly on a single GPU or if you prefer to manually control every aspect of your distributed training setup.

deep-learning model-training distributed-computing AI-research neural-networks

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 9 / 25

Maturity 16 / 25

Community 12 / 25

How are scores calculated?

Stars

Forks

Language

C++

License

Apache-2.0

Higher-rated alternatives

deepspeedai/DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference...

helmholtz-analytics/heat

Distributed tensors and Machine Learning framework with GPU and MPI acceleration in Python

hpcaitech/ColossalAI

Making large AI models cheaper, faster and more accessible

horovod/horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

bsc-wdc/dislib

The Distributed Computing library for python implemented using PyCOMPSs programming model for HPC.

Explore ML Frameworks

All categories Trending ML Framework directory Insights