BaizeAI/kcover
🧯 Kubernetes coverage for fault awareness and recovery, works for any LLMOps, MLOps, AI workloads.
This tool helps Site Reliability Engineers (SREs), DevOps engineers, and Machine Learning Operations (MLOps) specialists keep large-scale AI applications running smoothly. It takes information from your Kubernetes cluster about your AI jobs and automatically detects issues like hardware or network failures, then instantly recovers those operations without manual intervention. This ensures continuous training and service availability for your critical AI workloads.
Use this if you manage AI workloads running on Kubernetes and need automated detection and recovery from failures to minimize downtime and ensure continuous operation.
Not ideal if you are running small, non-critical AI applications or do not use Kubernetes for your deployments.
Stars
35
Forks
3
Language
Go
License
Apache-2.0
Category
Last pushed
Mar 08, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/mlops/BaizeAI/kcover"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
weibaohui/k8m
一款轻量级、跨平台的 Mini Kubernetes AI...
cloud-barista/cb-tumblebug
Cloud-Barista Multi-Cloud Infra Management Framework
kubesphere/kubesphere
The container platform tailored for Kubernetes multi-cloud, datacenter, and edge management ⎈ 🖥 ☁️
yuandrk/homelabops
GitOps homelab infrastructure with K3s, FluxCD, Terraform, and Ansible. Features multi-arch...
lenaxia/k8s-mechanic
A K8s controller that watches your cluster for failures and opens pull requests on your GitOps...