sehoffmann/dmlcloud

Painless distributed training with torch

/ 100

Emerging

This library helps machine learning engineers and researchers scale up their deep learning model training on high-performance computing (HPC) clusters. It takes standard PyTorch training scripts and makes it easy to distribute the workload across multiple GPUs and nodes. The output is a faster training process for complex models, enabling quicker experimentation and deployment.

Available on PyPI.

Use this if you are a machine learning engineer or researcher using PyTorch and need to train large models efficiently across multiple GPUs or machines in an HPC environment like a Slurm cluster.

Not ideal if you are looking for a high-level deep learning framework that abstracts away most of the PyTorch code, or if you only train models on a single GPU.

deep-learning model-training distributed-computing HPC-management ML-research

Maintenance 13 / 25

Adoption 5 / 25

Maturity 25 / 25

Community 6 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

BSD-3-Clause

Higher-rated alternatives

deepspeedai/DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference...

helmholtz-analytics/heat

Distributed tensors and Machine Learning framework with GPU and MPI acceleration in Python

hpcaitech/ColossalAI

Making large AI models cheaper, faster and more accessible

horovod/horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

bsc-wdc/dislib

The Distributed Computing library for python implemented using PyCOMPSs programming model for HPC.

Explore ML Frameworks

All categories Trending ML Framework directory Insights