BaizeAI/kcover

🧯 Kubernetes coverage for fault awareness and recovery, works for any LLMOps, MLOps, AI workloads.

41
/ 100
Emerging

This tool helps Site Reliability Engineers (SREs), DevOps engineers, and Machine Learning Operations (MLOps) specialists keep large-scale AI applications running smoothly. It takes information from your Kubernetes cluster about your AI jobs and automatically detects issues like hardware or network failures, then instantly recovers those operations without manual intervention. This ensures continuous training and service availability for your critical AI workloads.

Use this if you manage AI workloads running on Kubernetes and need automated detection and recovery from failures to minimize downtime and ensure continuous operation.

Not ideal if you are running small, non-critical AI applications or do not use Kubernetes for your deployments.

AI-operations MLOps Site-Reliability Kubernetes-management fault-tolerance
No Package No Dependents
Maintenance 10 / 25
Adoption 7 / 25
Maturity 16 / 25
Community 8 / 25

How are scores calculated?

Stars

35

Forks

3

Language

Go

License

Apache-2.0

Last pushed

Mar 08, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/mlops/BaizeAI/kcover"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.