ai-evaluation and agentrial
These are **competitors**: both provide statistical evaluation frameworks for AI agents, but agentrial focuses on rigorous statistical testing of agent behavior while ai-evaluation positions itself as a broader workflow evaluation platform, so users would likely select one based on whether they prioritize statistical rigor versus evaluation breadth.
About ai-evaluation
future-agi/ai-evaluation
Evaluation Framework for all your AI related Workflows
This framework helps AI product managers and developers assess, monitor, and guard their Large Language Model (LLM) applications. It takes your LLM's outputs, context, and user inputs to produce scores and explanations across 50+ metrics like faithfulness, toxicity, and relevancy. You can use it to ensure your AI behaves as expected and adheres to safety standards.
About agentrial
alepot55/agentrial
Statistical evaluation framework for AI agents
When you're building or deploying AI agents that use large language models, it can be frustrating when they work perfectly one day and fail the next. This project helps you reliably test your AI agents by running them multiple times on specific tasks, providing statistical confidence intervals for performance instead of single-run anecdotes. It tells you not just if an agent passed or failed, but where and why it failed, how much it cost, and how consistent it was, helping AI developers understand and improve their agent's real-world reliability.
Scores updated daily from GitHub, PyPI, and npm data. How scores work