Agent Evaluation Benchmarking AI Agents

Frameworks, platforms, and harnesses for systematically testing, benchmarking, and evaluating autonomous agent performance across capabilities like tool-use, reasoning, cost-efficiency, and safety. Does NOT include agent building frameworks, deployment infrastructure, or multi-agent competition environments designed primarily for training rather than evaluation.

There are 149 agent evaluation benchmarking agents tracked. 7 score above 50 (established tier). The highest-rated is StonyBrookNLP/appworld at 65/100 with 388 stars. 1 of the top 10 are actively maintained.

Get all 149 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=agents&subcategory=agent-evaluation-benchmarking&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Agent	Score	Tier	Stars	Language
1	StonyBrookNLP/appworld 🌍 AppWorld: A Controllable World of Apps and People for Benchmarking...	65	Established	388	Python
2	qualifire-dev/rogue AI Agent Evaluator & Red Team Platform	61	Established	1,012	Python
3	microsoft/WindowsAgentArena Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and...	56	Established	833	Python
4	future-agi/ai-evaluation Evaluation Framework for all your AI related Workflows	54	Established	84	Python
5	RouteWorks/RouterArena RouterArena: An open framework for evaluating LLM routers with standardized...	50	Established	71	Python
6	dreadnode/AIRTBench-Code Code Repository for: AIRTBench: Measuring Autonomous AI Red Teaming...	50	Established	93	Jupyter Notebook
7	agentscope-ai/OpenJudge OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards	50	Established	459	Python
8	steel-dev/leaderboard Open leaderboard for browser agents	49	Emerging	31	Astro
9	hidai25/eval-view Regression testing for AI agents. Snapshot behavior, diff tool calls, catch ...	49	Emerging	63	Python
10	SWE-bench/swe-bench.github.io Landing page + leaderboard for SWE-Bench benchmark	49	Emerging	12	JavaScript
11	SparkBeyond/agentune Tune your AI Agent to best meet its KPI with a cyclic process of analyze,...	49	Emerging	36	Python
12	ag2ai/Agents_Failure_Attribution Benchmark for automated failure attributions in agentic systems (🏆 ICML 2025...	48	Emerging	349	Python
13	Farama-Foundation/chatarena ChatArena (or Chat Arena) is a Multi-Agent Language Game Environments for...	48	Emerging	1,540	Python
14	rungalileo/agent-leaderboard Ranking LLMs on agentic tasks	47	Emerging	217	Jupyter Notebook
15	alepot55/agentrial Statistical evaluation framework for AI agents	46	Emerging	15	Python
16	SAILResearch/awesome-foundation-model-leaderboards A curated list of awesome leaderboard-oriented resources for AI domain	45	Emerging	321	—
17	ltzheng/agent-studio [ICLR 2025] A trinity of environments, tools, and benchmarks for general...	45	Emerging	229	Python
18	biological-alignment-benchmarks/biological-alignment-gridagents-benchmarks Safety challenges for RL and LLM agents' ability to learn and use...	45	Emerging	7	Python
19	Cognitive-AI-Systems/pogema-benchmark This is an umbrella repository that contains links and information about all...	44	Emerging	35	C++
20	itbench-hub/ITBench-Scenarios ⚠️ ARCHIVED - All development moved to...	42	Emerging	15	Python
21	geval-labs/geval Eval-driven release gates for AI applications	42	Emerging	14	TypeScript
22	justindobbs/Tracecore Deterministic runtime for agent evaluation	40	Emerging	7	Python
23	8monkey-ai/hebo-evals Markdown for Evals, a human-first format	40	Emerging	2	TypeScript
24	AISmithLab/HumanStudy-Bench HumanStudy-Bench: Towards AI Agent Design for Participant Simulation	40	Emerging	12	Python
25	StonyBrookNLP/appworld-leaderboard 🌍 Leaderboard Repository for "AppWorld: A Controllable World of Apps and...	39	Emerging	6	Python
26	Vexp-ai/vexp-swe-bench Open benchmark for AI coding agents on SWE-bench Verified. Compare...	38	Emerging	5	Shell
27	HumanStudy-Hub/HumanStudy-Bench HumanStudy-Bench: Community Edition — Standardized human study replays for...	38	Emerging	3	Python
28	jackjin1997/AgentBench-Live The open benchmark for AI agent task execution. Claude Code vs Gemini CLI —...	37	Emerging	3	Python
29	shubchat/loab LOAB: A benchmark for evaluating LLM agents on end-to-end mortgage lending...	37	Emerging	5	Python
30	laiso/ts-bench Measure and compare the performance of AI coding agents on TypeScript tasks.	36	Emerging	210	TypeScript
31	Privatris/AgentLeak AgentLeak: Open benchmark for privacy leakage in LLM agents — 7 channels,...	36	Emerging	9	Python
32	stchakwdev/Secret_H_Evals Multi-agent strategic deception evaluation framework for LLMs using Secret...	36	Emerging	3	Python
33	wallezhang/agent-eval A YAML-config-driven CLI tool for evaluating AI agents	35	Emerging	2	Go
34	CosmosYi/AutoControl-Arena 🛡️AutoControl Arena: Synthesizing Executable Test Environments for Frontier...	34	Emerging	7	Python
35	BUAA-CLab/CircuitMind The code about TC-Bench and CircuitMind	34	Emerging	8	Python
36	LeoYeAI/myclaw-bench The definitive benchmark for AI agents on OpenClaw. 45 tasks across 4 tiers....	34	Emerging	1	Python
37	OpenSymbolicAI/benchmark-py-legalbench LegalBench benchmark: GoalSeeking agent for 162 legal reasoning tasks	34	Emerging	1	Python
38	vectorize-io/agent-memory-benchmark Agent Memory Benchmark	33	Emerging	11	Python
39	lechmazur/elimination_game A multi-player tournament benchmark that tests LLMs in social reasoning,...	33	Emerging	302	—
40	future-agi/futureagi-sdk Production-grade AI evaluation, prompt management & observability SDK....	33	Emerging	37	Python
41	elliot736/modelab Open-source A/B testing framework for LLM systems with deterministic...	32	Emerging	2	Python
42	Icarus603/tech-innovation-eval-agent 企业科创能力评估智能体	32	Emerging	1	Python
43	plaited/agent-eval-harness Evaluate AI agents with Unix-style pipeline commands. Schema-driven adapters...	32	Emerging	2	TypeScript
44	yjyddq/RiOSWorld [NeurIPS 2025] Official repository of RiOSWorld: Benchmarking the Risk of...	32	Emerging	117	HTML
45	madhavkrishangarg/ReviewEval ReviewEval: An Evaluation Framework for AI-Generated Reviews	32	Emerging	3	Python
46	campfirein/brv-bench Benchmark suite for evaluating retrieval quality and latency of AI agent...	30	Emerging	11	Python
47	4xxpray/ai-eval 🤖 Evaluate and optimize LLM prompts with multi-provider support, rich...	30	Emerging	1	Go
48	yotambraun/Toolscore Python framework for evaluating LLM tool-calling behavior with comprehensive...	29	Experimental	5	Python
49	nottelabs/open-operator-evals Opensource benchmark evaluating web operators/agents performance	29	Experimental	47	Python
50	The-Swarm-Corporation/ModelArena ModelArena: A Competitive Environment for Multi-Agent Training	29	Experimental	9	Python
51	lechmazur/step_game Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception...	27	Experimental	85	—
52	mlbio-epfl/HeurekaBench [ICLR 2026] A framework to "create benchmarks" and "evaluate AI...	27	Experimental	10	Python
53	melchiorhering/GUI-OS-AI-Agent-Benchmarking A modular framework for benchmarking multimodal AI agents in a reproducible,...	27	Experimental	1	Jupyter Notebook
54	Parslee-ai/statebench Conformance test for stateful AI agents. Measures state correctness over time.	26	Experimental	4	Python
55	sstklen/washin-api-benchmark From Benchmarks to Architecture — We tested 30+ AI APIs, designed routing...	26	Experimental	3	—
56	Terminus-Lab/themis LLM evaluation service with validated judges. Multi-dimensional scoring...	26	Experimental	2	Go
57	AnLuo1/Assisted-DS This is the official page of the paper "AssistedDS: Benchmarking How...	25	Experimental	5	Python
58	Syncause/syncause-benchmark AI-driven RCA benchmark evaluating Syncause’s accuracy, interpretability,...	24	Experimental	10	Python
59	The-Swarm-Corporation/Xray-Bench XRayBench is a state-of-the-art evaluation platform designed specifically...	24	Experimental	2	Python
60	xyva-yuangui/smartness-eval 🎯 12-Dimension AI Agent Intelligence Assessment \| 12维度 AI Agent 智能度自动评估技能 \|...	24	Experimental	2	Python
61	nagu-io/agent-settlement-bench Benchmark for evaluating safety of AI agents in irreversible financial...	24	Experimental	3	JavaScript
62	IlyasFardaouix/Agent-racing-league The world's first racing league for AI agents. Think F1 ,but the drivers are AI.	23	Experimental	1	—
63	pauldebdeep9/awesome-agentic-evaluation A curated list of benchmarks, environments, papers, and tooling for agentic...	23	Experimental	1	—
64	yazcaleb/can-is-not-may Authority Models for Governable AI Agents — paper, AuthorityBench (54...	23	Experimental	1	TeX
65	DUBSOpenHub/shadow-score-spec A framework-agnostic metric for measuring AI code generation quality....	23	Experimental	2	Python
66	clouatre-labs/llm-agent-experiments Benchmarking open-weight LLM coding agents as SCOUT delegates: model...	23	Experimental	1	Python
67	dataanswer/awesome-agent-benchmarks A curated collection of the world’s most advanced benchmark datasets for...	23	Experimental	5	—
68	azurefr/AI-Arena---Benchmarking-Platform-for-Autonomous-AI-Agents Benchmark autonomous AI agents by measuring their reasoning and competitive...	22	Experimental	—	Python
69	joshualamerton/agent-evaluation-lab Sandbox platform for testing and evaluating autonomous agents	22	Experimental	—	Python
70	osheryadgar/tendedloop-arena Python SDK for TendedLoop Arena — multi-agent gamification research...	22	Experimental	—	Python
71	Ethandata/crucible-sim Crucible — The Economic Autonomy Standard. Stress-test AI agents under...	22	Experimental	—	Python
72	MukundaKatta/AgentBench Agent evaluation and benchmarking suite — accuracy, efficiency, and tool...	22	Experimental	—	Python
73	Vinashu/razor-cascade Framework to benchmark same-provider LLM cascading and measure API cost,...	22	Experimental	—	TypeScript
74	choutos/agent-eval-framework Lightweight, practical evaluation framework for AI agents in production....	22	Experimental	—	Shell
75	dario-github/agent-self-evolution Automated evaluation, ablation testing, and continuous improvement framework...	22	Experimental	—	Python
76	ristponex/awesome-minimax-m2.7 🧠 Awesome MiniMax M2.7 — Self-evolving coding AI. Integrations, benchmarks,...	22	Experimental	—	—
77	davidgracemann/statma stat-my-agent ; benchmark consistency, tool-use, failure-recovery and...	22	Experimental	—	Python
78	evan66547/Contract-Reviewer-Agent-Eval ⚖️ Benchmark evaluation framework for AI-powered legal contract review...	22	Experimental	—	Python
79	dairongzhen3-creator/illusion-of-emergence Why your multi-agent LLM deception experiment might be measuring prompt...	22	Experimental	—	—
80	widingmarcus-cyber/opengym 240 challenges to test if your AI agent actually works — not just the model,...	22	Experimental	1	Python
81	diorwave/agent-playground A minimal sandbox to run, score, and compare AI agent outputs locally.	22	Experimental	4	Python
82	alexmar07/agent-arena A self-regulating arena where AI agents compete for work through sealed-bid auctions	22	Experimental	—	Python
83	yiyangzhang-ai/open-agent-eval Lightweight open-source toolkit for evaluating tool-calling AI agents on...	22	Experimental	—	Python
84	AaronZhou-THU/agent-eval-workbench A practical workbench for prompt, model, and mocked workflow evaluation with...	22	Experimental	—	Python
85	tsanthoshreddy/agent-qa-lab Trace-aware regression harness for tool-using Strands agents with...	22	Experimental	—	Python
86	someonehereexists/AI-Arena---Benchmarking-Platform-for-Autonomous-AI-Agents AI Arena is a competitive evaluation framework where multiple AI agents...	22	Experimental	1	Python
87	BayramAnnakov/eval-coach Agent Skill for Evaluation-Driven Development (EDD) - guide AI evaluation...	22	Experimental	3	Python
88	justindobbs/awesome-certified-agents A community catalog of autonomous agents and bundles certified by passing...	22	Experimental	1	Python
89	eliumusk/agentreflect AI agent self-reflection & self-evaluation tool. Built by an AI, for AIs.	21	Experimental	—	Python
90	kadubon/search-stability-lab Theory-to-experiment lab for search stability in long-running agents under...	21	Experimental	—	Python
91	FishIntelGlobal/uncertainty-axioms Computational validation suite for The First Principles of Uncertainty...	21	Experimental	—	Python
92	dikatwoone/FluxCodeBench 🔍 Evaluate LLM agents on multi-phase programming tasks with FluxCodeBench,...	21	Experimental	—	Python
93	thisisyoussef/ghostfolio-agent-eval-dataset Deterministic golden eval dataset for finance-domain agent testing...	21	Experimental	—	—
94	deathlabs/sunshower Declarative and Distributed Benchmarking for AI Agents	21	Experimental	—	Python
95	digital-rain-tech/ara-eval ARA-Eval: Agentic Readiness Assessment — evaluation framework for...	21	Experimental	—	Python
96	axxafo/awesome-agent-benchmarks 🧠 Discover and evaluate advanced benchmark datasets for Large Language Model...	21	Experimental	3	—
97	mireya001/evalops-kit CI-native evals for tool-using agents: datasets, traces, deterministic...	21	Experimental	—	Python
98	robobobby/agenteval Behavior test framework for AI agents. Define tests in YAML. Run against...	21	Experimental	—	Python
99	SainathPattipati/agent-evaluation-harness Framework to benchmark and evaluate multi-agent system performance,...	21	Experimental	—	Python
100	HomenShum/nodebench-boilerplate Production-ready boilerplate for AI agent projects using NodeBench MCP. 129...	21	Experimental	—	TypeScript
101	fraction12/open-rank The open benchmark for AI agents — daily puzzles, public rankings	21	Experimental	—	Astro
102	1sdeb/sidemind.ai AI Assurance Metrics Analyzer - Evaluate LLM outputs with 15 quality...	21	Experimental	—	JavaScript
103	greynewell/swe-bench-pro-action GitHub Action for SWE-bench Pro evaluation powered by mcpbr	21	Experimental	—	Shell
104	jstilb/meaningful_metrics Open-source evaluation frameworks for human-centered metrics, AI evaluation...	21	Experimental	—	Python
105	leaderboard-md/spec LEADERBOARD.md — Open standard for AI agent performance benchmarking. Track...	21	Experimental	—	HTML
106	akshan-main/equitas-benchmark Corruption-robustness benchmark for hierarchical multi-LLM committees	21	Experimental	—	Python
107	GZQKCHQM/M_bench Measure Apple Silicon performance for Python and NumPy workloads, providing...	21	Experimental	—	Python
108	Jesutofunmie/Haiku-4.5-vs-Minimax-2.1 🧠 Benchmark Haiku 4.5 and MiniMax M2.1 on agentic tasks, revealing strengths...	21	Experimental	1	Shell
109	lechmazur/pgg_bench Public Goods Game (PGG) Benchmark: Contribute & Punish is a multi-agent...	21	Experimental	39	—
110	jstilb/llm-eval-framework LLM evaluation framework with custom metrics, LLM-as-judge, and...	21	Experimental	—	Python
111	pyros-projects/agent-comparison Qualitative benchmark suite for evaluating AI coding agents and...	21	Experimental	2	Python
112	zahere/stochastic-circuit-breaker Statistically optimal circuit breaker for stochastic systems. 4-state...	21	Experimental	—	Python
113	Pashasan/llm_price_sensitivity_evaluation Conjoint experiment measuring price sensitivity and economic preferences of...	21	Experimental	—	Python
114	speed785/evalforge Agent Evaluation Harness — write repeatable, measurable evals for AI agents....	21	Experimental	—	Python
115	NeoSkillFactory/llm-benchmark Automatically benchmarks LLM responses across multiple models using...	21	Experimental	—	JavaScript
116	datalayer-challenges/dabench-leaderboard 🤖 A2A-compatible DABench evaluation leaderboard with AgentBeats architecture.	20	Experimental	1	Python
117	BAAI-Agents/SWITCH SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in...	19	Experimental	5	—
118	ian-flores/securebench Evaluation and benchmarking framework for R LLM agents	19	Experimental	—	R
119	jonradoff/hiddenbench HiddenBench: Benchmark for evaluating collective reasoning in multi-agent LLM systems	19	Experimental	—	Python
120	manishklach/agentic_cpu_bottleneck_bench Vendor-neutral simulator + benchmark for agent runtime overhead: fan-out,...	19	Experimental	—	Python
121	vvsotnikov/astro-bench Can AI agents do real science? Benchmarking AI agents on KASCADE cosmic ray...	19	Experimental	4	Python
122	Emersoft76/ai-agent-systems-advanced-benchmarking Modular AI agent system with LLMs, tools, and benchmark optimization	18	Experimental	1	Python
123	tostechbr/evoloop Framework-agnostic eval toolkit for AI agents — capture traces, judge...	17	Experimental	—	Python
124	josephsenior/agent-evaluation-platform 🚀 Professional-grade AI Agent Evaluation Platform. Multi-provider...	17	Experimental	—	Python
125	vectorize-io/hindsight-benchmarks Hindsight Benchmarks Results	17	Experimental	2	Python
126	messeb/py-deepeval-behave-bdd-testing-example An example that combines Behave (BDD testing) with DeepEval (LLM evaluation)...	15	Experimental	—	Python
127	memvid/memvidbench Benchmark tool for evaluating Memvid on the LoCoMo (Long-term Conversational...	14	Experimental	4	TypeScript
128	patrikmarshall/opencode-benchmark-dashboard Measure and compare speed and accuracy of large language models using...	14	Experimental	—	—
129	graciegould/agent-performance-tests Benchmarks how codebase structure affects AI agent efficiency — tool calls,...	14	Experimental	—	TypeScript
130	crabsatellite/lem-experiments Reproducible experiments for: LLM Exposure Monitoring — A Security Framework...	14	Experimental	—	JavaScript
131	Lap-Platform/Lap-benchmark-docs LAP benchmark results — 500 runs, 50 specs, 5 formats. Agents run 35%...	14	Experimental	1	HTML
132	Software-Engineering-Arena/SWE-Agent-Arena Compare agents pairwise via multi‑round evaluations for SE tasks.	14	Experimental	1	JavaScript
133	Red1-Rahman/Prompt-Eval Streamlit prompt evaluation tool that auto-generates test cases, run evals,...	14	Experimental	1	Python
134	jamjet-labs/jamjet-benchmarks JamJet benchmarks, migration guides, and feature comparisons vs LangGraph,...	14	Experimental	—	Python
135	Ritvik777/Galileo_Project Galileo: Observations and Evals	14	Experimental	—	Python
136	avdolgikh/poker-coach-eval-harness LLM-powered evaluation harness for detecting orchestration failures in AI...	14	Experimental	—	Python
137	memstate-ai/memstate-benchmark Open-source benchmark for AI agent memory systems — compare Memstate, mem0,...	14	Experimental	—	TypeScript
138	prajaktapandit7/conversational-AI-evaluation Structured evaluation of 30 support bot conversations measuring containment,...	13	Experimental	—	—
139	brianjmarvin/datasnack-ai The DataSnack AI Agent Evaluator is a CLI tool that automates the testing of...	13	Experimental	—	Go
140	corradocavalli/agentic_evaluation Demonstration of testing and evaluation patterns for AI agents using Azure...	13	Experimental	—	Python
141	lintware/AI_Agent_Frameworks_Comparison Benchmark comparing 8 AI agent frameworks (SmolAgents, OpenAI Agents SDK,...	13	Experimental	—	Python
142	ImSudhakar07/RivalReview-Evals An eval platform that continuously monitors the quality of the /RivalReview...	13	Experimental	—	Python
143	yzotop/ab-factory-demo Deterministic multi-agent A/B test evaluation system with policy engine,...	13	Experimental	—	Python
144	Jojodicus/ai-identity-benchmark Does the identity in a system prompt change performance?	13	Experimental	—	Python
145	EmZod/Earth-Magnetic-Field-Research-Minimax-w-subagents-in-pi- Multi-agent research orchestration using MiniMax-M2.1 with thinking enabled....	13	Experimental	—	HTML
146	codedbyelif/els-judge Multi-LLM consensus engine for automated code review, diff analysis, and...	13	Experimental	—	Python
147	EmZod/Haiku-4.5-vs-Minimax-2.1 Systematic benchmark comparing Claude Haiku 4.5 vs MiniMax M2.1 on agentic...	13	Experimental	—	Shell
148	mohsinsheikhani/support-fte-evals Eval-driven Customer Support FTE using OpenAI Agents SDK. Multi-agent...	11	Experimental	—	Python
149	rithirs/Autoeval Automated evaluation framework using multi-agent backend with FastAPI for...	10	Experimental	1	Python

Comparisons in this category

ai-evaluation and agentrial (54 vs 46)