hidai25/eval-view
Regression testing for AI agents. Snapshot behavior, diff tool calls, catch regressions in CI. Works with LangGraph, CrewAI, OpenAI, Anthropic.
This project helps developers prevent unexpected changes in their AI agents' behavior. It takes your agent's current performance as a baseline and then compares it against new runs, highlighting any differences in the tools used or the output generated. It's for individual developers, startups, and small AI teams who build and maintain AI agents and want to ensure consistent, reliable operation.
Use this if you need to automatically catch silent regressions in your AI agent's behavior, like a change in tool selection or output quality, before users encounter them.
Not ideal if you are looking for general AI agent observability or a broad platform for comparing different prompts or models.
Stars
63
Forks
16
Language
Python
License
Apache-2.0
Category
Last pushed
Mar 13, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/agents/hidai25/eval-view"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
StonyBrookNLP/appworld
🌍 AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and...
qualifire-dev/rogue
AI Agent Evaluator & Red Team Platform
microsoft/WindowsAgentArena
Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of...
future-agi/ai-evaluation
Evaluation Framework for all your AI related Workflows
RouteWorks/RouterArena
RouterArena: An open framework for evaluating LLM routers with standardized datasets, metrics,...