alepot55/agentrial

Statistical evaluation framework for AI agents

46
/ 100
Emerging

When you're building or deploying AI agents that use large language models, it can be frustrating when they work perfectly one day and fail the next. This project helps you reliably test your AI agents by running them multiple times on specific tasks, providing statistical confidence intervals for performance instead of single-run anecdotes. It tells you not just if an agent passed or failed, but where and why it failed, how much it cost, and how consistent it was, helping AI developers understand and improve their agent's real-world reliability.

Available on PyPI.

Use this if you need to rigorously evaluate the reliability, consistency, cost, and latency of your AI agents, especially when dealing with the inherent variability of large language models, and want clear statistical insights into their performance.

Not ideal if you only need a quick, single-run check of an agent's functionality or if you're not working with AI agents that exhibit variable behavior across runs.

AI-agent-development LLM-evaluation model-reliability AI-quality-assurance agent-performance-testing
Maintenance 10 / 25
Adoption 6 / 25
Maturity 20 / 25
Community 10 / 25

How are scores calculated?

Stars

15

Forks

2

Language

Python

License

MIT

Last pushed

Feb 06, 2026

Commits (30d)

0

Dependencies

7

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/agents/alepot55/agentrial"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.