martinimarcello00/SRE-agent
Autonomous agent for Kubernetes incident detection, diagnosis, and mitigation using LLMs and modular workflows. Integrates LangChain, LangGraph, and MCP servers to enable automated SRE tasks in cloud-native environments.
This system helps Site Reliability Engineers (SREs) or Operations teams automatically detect, diagnose, and resolve issues in Kubernetes environments. It takes in operational data from observability tools like Prometheus and Jaeger, then automatically identifies problems and their root causes within microservice architectures. The output is a detailed diagnosis and proposed mitigation strategy for complex incidents.
Use this if you need to significantly reduce the time it takes to understand and fix complex incidents in your Kubernetes clusters, especially in dynamic microservice environments.
Not ideal if your systems are not deployed on Kubernetes or if you prefer a fully manual, human-driven incident response process.
Stars
9
Forks
2
Language
Jupyter Notebook
License
—
Category
Last pushed
Jan 30, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/martinimarcello00/SRE-agent"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
dynamiq-ai/dynamiq
Dynamiq is an orchestration framework for agentic AI and LLM applications
eosphoros-ai/DB-GPT
AI Native Data App Development framework with AWEL(Agentic Workflow Expression Language) and Agents
BidingCC/BuildingAI
AI时代的WordPress,东半球首个积木式AI应用搭建系统,人人都可免费搭建自己的AI应用系统,例如企业智能体系统、AI漫剧系统、AI论文学术系统、AI客服系统...
dataelement/bisheng
BISHENG is an open LLM devops platform for next generation Enterprise AI applications. Powerful...
potpie-ai/potpie
Spec-driven development for large codebases