ai-evaluation and agentrial

These are **competitors**: both provide statistical evaluation frameworks for AI agents, but agentrial focuses on rigorous statistical testing of agent behavior while ai-evaluation positions itself as a broader workflow evaluation platform, so users would likely select one based on whether they prioritize statistical rigor versus evaluation breadth.

ai-evaluation
54
Established
agentrial
46
Emerging
Maintenance 10/25
Adoption 9/25
Maturity 15/25
Community 20/25
Maintenance 10/25
Adoption 6/25
Maturity 20/25
Community 10/25
Stars: 84
Forks: 29
Downloads:
Commits (30d): 0
Language: Python
License: GPL-3.0
Stars: 15
Forks: 2
Downloads:
Commits (30d): 0
Language: Python
License: MIT
No Package No Dependents
No risk flags

About ai-evaluation

future-agi/ai-evaluation

Evaluation Framework for all your AI related Workflows

This framework helps AI product managers and developers assess, monitor, and guard their Large Language Model (LLM) applications. It takes your LLM's outputs, context, and user inputs to produce scores and explanations across 50+ metrics like faithfulness, toxicity, and relevancy. You can use it to ensure your AI behaves as expected and adheres to safety standards.

LLM-operations AI-safety prompt-engineering chatbot-testing model-governance

About agentrial

alepot55/agentrial

Statistical evaluation framework for AI agents

When you're building or deploying AI agents that use large language models, it can be frustrating when they work perfectly one day and fail the next. This project helps you reliably test your AI agents by running them multiple times on specific tasks, providing statistical confidence intervals for performance instead of single-run anecdotes. It tells you not just if an agent passed or failed, but where and why it failed, how much it cost, and how consistent it was, helping AI developers understand and improve their agent's real-world reliability.

AI-agent-development LLM-evaluation model-reliability AI-quality-assurance agent-performance-testing

Scores updated daily from GitHub, PyPI, and npm data. How scores work