alepot55/agentrial
Statistical evaluation framework for AI agents
When you're building or deploying AI agents that use large language models, it can be frustrating when they work perfectly one day and fail the next. This project helps you reliably test your AI agents by running them multiple times on specific tasks, providing statistical confidence intervals for performance instead of single-run anecdotes. It tells you not just if an agent passed or failed, but where and why it failed, how much it cost, and how consistent it was, helping AI developers understand and improve their agent's real-world reliability.
Available on PyPI.
Use this if you need to rigorously evaluate the reliability, consistency, cost, and latency of your AI agents, especially when dealing with the inherent variability of large language models, and want clear statistical insights into their performance.
Not ideal if you only need a quick, single-run check of an agent's functionality or if you're not working with AI agents that exhibit variable behavior across runs.
Stars
15
Forks
2
Language
Python
License
MIT
Category
Last pushed
Feb 06, 2026
Commits (30d)
0
Dependencies
7
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/agents/alepot55/agentrial"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Compare
Higher-rated alternatives
StonyBrookNLP/appworld
🌍 AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and...
qualifire-dev/rogue
AI Agent Evaluator & Red Team Platform
microsoft/WindowsAgentArena
Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of...
future-agi/ai-evaluation
Evaluation Framework for all your AI related Workflows
agentscope-ai/OpenJudge
OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards