reacher-z/gpu-monitor
Lightweight NVIDIA GPU monitor — alerts on Slack/Discord/Telegram/20 channels when training crashes or GPU overheats. Zero dependencies. Single file.
This tool helps machine learning engineers and researchers avoid wasted GPU compute time. It monitors NVIDIA GPUs for common issues like job crashes, overheating, memory leaks, or idle GPUs, then immediately sends alerts to your preferred communication channel (like Slack or Discord). This way, you can quickly address problems and resume your work, rather than discovering issues hours later.
138 stars.
Use this if you run GPU-intensive machine learning training jobs and want instant notifications about critical issues that could waste time or damage hardware.
Not ideal if you only need a manual, occasional check of GPU status or if you already have a comprehensive monitoring system like Prometheus and Grafana fully integrated.
Stars
138
Forks
16
Language
Python
License
MIT
Category
Last pushed
Mar 08, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/reacher-z/gpu-monitor"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
modelscope/modelscope
ModelScope: bring the notion of Model-as-a-Service to life.
basetenlabs/truss
The simplest way to serve AI/ML models in production
Lightning-AI/LitServe
A minimal Python framework for building custom AI inference servers with full control over...
deepjavalibrary/djl-serving
A universal scalable machine learning model deployment solution
tensorflow/serving
A flexible, high-performance serving system for machine learning models