HPC Cluster Management ML Frameworks

Resources, guides, and tools for setting up, configuring, and managing HPC clusters and distributed computing infrastructure for ML workloads. Does NOT include general cloud computing platforms, containerization tools, or ML frameworks themselves.

There are 33 hpc cluster management frameworks tracked. 1 score above 70 (verified tier). The highest-rated is qualcomm/ai-hub-models at 72/100 with 940 stars. 1 of the top 10 are actively maintained.

Get all 33 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=ml-frameworks&subcategory=hpc-cluster-management&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Framework Score Tier
1 qualcomm/ai-hub-models

Qualcomm® AI Hub Models is our collection of state-of-the-art machine...

72
Verified
2 petuum/adaptdl

Resource-adaptive cluster scheduler for deep learning training.

58
Established
3 zszazi/Deep-learning-in-cloud

List of Deep Learning Cloud Providers

56
Established
4 lincc-frameworks/hyrax

Hyrax - A low-code framework for rapid experimentation with ML &...

55
Established
5 intel/ai-reference-models

Intel® AI Reference Models: contains Intel optimizations for running deep...

51
Established
6 openhackathons-org/gpubootcamp

This repository consists for gpu bootcamp material for HPC and AI

51
Established
7 HydroRoll-Team/HydroRoll

跨平台、多任务、高度自定义的骰系开发框架。

44
Emerging
8 HPCNow/hpcnow-labs

HPCNow! training material and hands-on sessions

39
Emerging
9 pescap/EasyHPC

A practical introduction to High Performance Computing (HPC)

36
Emerging
10 ray-project/ray-acm-workshop-2023

Scalable/Distributed Computer Vision with Ray

36
Emerging
11 binga/cloud-gpus

This repository contains information about Cloud GPU offerings for Machine...

36
Emerging
12 opencomputeproject/ocp-diag-windtunnel

Building & testing private AI on HPC.

35
Emerging
13 debnsuma/ray-for-developers

A comprehensive hands-on guide to building production-grade distributed...

35
Emerging
14 hkust-hpc-team/hkust-hpc

Handbook for AI / HPC users on HKUST central clusters

35
Emerging
15 knagrecha/hydra

Execution framework for multi-task model parallelism. Enables the training...

33
Emerging
16 onlyrobot/bray

Bray is based on Ray and outperforms Ray in practical distributed...

30
Emerging
17 Roulbac/uv-func

A Python decorator to run functions in isolated virtual environments...

30
Emerging
18 Skyld-Labs/ModelHunter

ModelHunter is a powerful pipeline designed to extract machine learning...

29
Experimental
19 hydra-hoard/hydra

A decentralised application that creates high quality machine learning datasets

27
Experimental
20 uw-mad-dash/shockwave

Artifact for "Shockwave: Fair and Efficient Cluster Scheduling for Dynamic...

27
Experimental
21 jonathandinu/spark-ray-data-science

Supporting content (slides and exercises) for the Pearson video series...

26
Experimental
22 parisimaa/NYU-HPC

NYU HPC user instruction

25
Experimental
23 gpu-cli/zerostart

Fast cold starts for GPU Python. Streaming wheel extraction for when large...

25
Experimental
24 breadboardfoundry/GPU-Infrastructure

GPU compute infrastructure for research teams running machine learning experiments.

19
Experimental
25 SupreethRao99/slurmy

template scripts and notes for using SLURM on Nvidia DGX GPU cluster

18
Experimental
26 alifzl/NeSI-Project-Template

NeSI HPC DL project Scaffolding Template

17
Experimental
27 smirko-dev/machine-learning-rpi

Setup ML for Raspberry Pi

17
Experimental
28 erectbranch/enroot-on-slurm

Examples of using Enroot with Slurm for distributed deep learning

15
Experimental
29 Adhytm/multi-gpu-debug-notes

Debugging and isolating GPU context preemption issus in heterogeneous...

14
Experimental
30 RichardScottOZ/experimenta-ml-kiro

experimenta-ml for kiro-cli

14
Experimental
31 settadev/setta

Streamline Python coding, configuration, UI creation, and onboarding.

14
Experimental
32 Akshay3510/Hydra

🔍 Develop advanced knowledge compilers and #SAT solvers with Hydra, a robust...

13
Experimental
33 drai-inn/uoa-ai-gpu-docs

University of Auckland AI GPU cluster support pages

12
Experimental