Agent Evaluation Benchmarking AI Agents

Frameworks, platforms, and harnesses for systematically testing, benchmarking, and evaluating autonomous agent performance across capabilities like tool-use, reasoning, cost-efficiency, and safety. Does NOT include agent building frameworks, deployment infrastructure, or multi-agent competition environments designed primarily for training rather than evaluation.

There are 149 agent evaluation benchmarking agents tracked. 7 score above 50 (established tier). The highest-rated is StonyBrookNLP/appworld at 65/100 with 388 stars. 1 of the top 10 are actively maintained.

Get all 149 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=agents&subcategory=agent-evaluation-benchmarking&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Agent Score Tier
1 StonyBrookNLP/appworld

🌍 AppWorld: A Controllable World of Apps and People for Benchmarking...

65
Established
2 qualifire-dev/rogue

AI Agent Evaluator & Red Team Platform

61
Established
3 microsoft/WindowsAgentArena

Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and...

56
Established
4 future-agi/ai-evaluation

Evaluation Framework for all your AI related Workflows

54
Established
5 RouteWorks/RouterArena

RouterArena: An open framework for evaluating LLM routers with standardized...

50
Established
6 dreadnode/AIRTBench-Code

Code Repository for: AIRTBench: Measuring Autonomous AI Red Teaming...

50
Established
7 agentscope-ai/OpenJudge

OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards

50
Established
8 steel-dev/leaderboard

Open leaderboard for browser agents

49
Emerging
9 hidai25/eval-view

Regression testing for AI agents. Snapshot behavior, diff tool calls, catch ...

49
Emerging
10 SWE-bench/swe-bench.github.io

Landing page + leaderboard for SWE-Bench benchmark

49
Emerging
11 SparkBeyond/agentune

Tune your AI Agent to best meet its KPI with a cyclic process of analyze,...

49
Emerging
12 ag2ai/Agents_Failure_Attribution

Benchmark for automated failure attributions in agentic systems (🏆 ICML 2025...

48
Emerging
13 Farama-Foundation/chatarena

ChatArena (or Chat Arena) is a Multi-Agent Language Game Environments for...

48
Emerging
14 rungalileo/agent-leaderboard

Ranking LLMs on agentic tasks

47
Emerging
15 alepot55/agentrial

Statistical evaluation framework for AI agents

46
Emerging
16 SAILResearch/awesome-foundation-model-leaderboards

A curated list of awesome leaderboard-oriented resources for AI domain

45
Emerging
17 ltzheng/agent-studio

[ICLR 2025] A trinity of environments, tools, and benchmarks for general...

45
Emerging
18 biological-alignment-benchmarks/biological-alignment-gridagents-benchmarks

Safety challenges for RL and LLM agents' ability to learn and use...

45
Emerging
19 Cognitive-AI-Systems/pogema-benchmark

This is an umbrella repository that contains links and information about all...

44
Emerging
20 itbench-hub/ITBench-Scenarios

⚠️ ARCHIVED - All development moved to...

42
Emerging
21 geval-labs/geval

Eval-driven release gates for AI applications

42
Emerging
22 justindobbs/Tracecore

Deterministic runtime for agent evaluation

40
Emerging
23 8monkey-ai/hebo-evals

Markdown for Evals, a human-first format

40
Emerging
24 AISmithLab/HumanStudy-Bench

HumanStudy-Bench: Towards AI Agent Design for Participant Simulation

40
Emerging
25 StonyBrookNLP/appworld-leaderboard

🌍 Leaderboard Repository for "AppWorld: A Controllable World of Apps and...

39
Emerging
26 Vexp-ai/vexp-swe-bench

Open benchmark for AI coding agents on SWE-bench Verified. Compare...

38
Emerging
27 HumanStudy-Hub/HumanStudy-Bench

HumanStudy-Bench: Community Edition — Standardized human study replays for...

38
Emerging
28 jackjin1997/AgentBench-Live

The open benchmark for AI agent task execution. Claude Code vs Gemini CLI —...

37
Emerging
29 shubchat/loab

LOAB: A benchmark for evaluating LLM agents on end-to-end mortgage lending...

37
Emerging
30 laiso/ts-bench

Measure and compare the performance of AI coding agents on TypeScript tasks.

36
Emerging
31 Privatris/AgentLeak

AgentLeak: Open benchmark for privacy leakage in LLM agents — 7 channels,...

36
Emerging
32 stchakwdev/Secret_H_Evals

Multi-agent strategic deception evaluation framework for LLMs using Secret...

36
Emerging
33 wallezhang/agent-eval

A YAML-config-driven CLI tool for evaluating AI agents

35
Emerging
34 CosmosYi/AutoControl-Arena

🛡️AutoControl Arena: Synthesizing Executable Test Environments for Frontier...

34
Emerging
35 BUAA-CLab/CircuitMind

The code about TC-Bench and CircuitMind

34
Emerging
36 LeoYeAI/myclaw-bench

The definitive benchmark for AI agents on OpenClaw. 45 tasks across 4 tiers....

34
Emerging
37 OpenSymbolicAI/benchmark-py-legalbench

LegalBench benchmark: GoalSeeking agent for 162 legal reasoning tasks

34
Emerging
38 vectorize-io/agent-memory-benchmark

Agent Memory Benchmark

33
Emerging
39 lechmazur/elimination_game

A multi-player tournament benchmark that tests LLMs in social reasoning,...

33
Emerging
40 future-agi/futureagi-sdk

Production-grade AI evaluation, prompt management & observability SDK....

33
Emerging
41 elliot736/modelab

Open-source A/B testing framework for LLM systems with deterministic...

32
Emerging
42 Icarus603/tech-innovation-eval-agent

企业科创能力评估智能体

32
Emerging
43 plaited/agent-eval-harness

Evaluate AI agents with Unix-style pipeline commands. Schema-driven adapters...

32
Emerging
44 yjyddq/RiOSWorld

[NeurIPS 2025] Official repository of RiOSWorld: Benchmarking the Risk of...

32
Emerging
45 madhavkrishangarg/ReviewEval

ReviewEval: An Evaluation Framework for AI-Generated Reviews

32
Emerging
46 campfirein/brv-bench

Benchmark suite for evaluating retrieval quality and latency of AI agent...

30
Emerging
47 4xxpray/ai-eval

🤖 Evaluate and optimize LLM prompts with multi-provider support, rich...

30
Emerging
48 yotambraun/Toolscore

Python framework for evaluating LLM tool-calling behavior with comprehensive...

29
Experimental
49 nottelabs/open-operator-evals

Opensource benchmark evaluating web operators/agents performance

29
Experimental
50 The-Swarm-Corporation/ModelArena

ModelArena: A Competitive Environment for Multi-Agent Training

29
Experimental
51 lechmazur/step_game

Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception...

27
Experimental
52 mlbio-epfl/HeurekaBench

[ICLR 2026] A framework to "create benchmarks" and "evaluate AI...

27
Experimental
53 melchiorhering/GUI-OS-AI-Agent-Benchmarking

A modular framework for benchmarking multimodal AI agents in a reproducible,...

27
Experimental
54 Parslee-ai/statebench

Conformance test for stateful AI agents. Measures state correctness over time.

26
Experimental
55 sstklen/washin-api-benchmark

From Benchmarks to Architecture — We tested 30+ AI APIs, designed routing...

26
Experimental
56 Terminus-Lab/themis

LLM evaluation service with validated judges. Multi-dimensional scoring...

26
Experimental
57 AnLuo1/Assisted-DS

This is the official page of the paper "AssistedDS: Benchmarking How...

25
Experimental
58 Syncause/syncause-benchmark

AI-driven RCA benchmark evaluating Syncause’s accuracy, interpretability,...

24
Experimental
59 The-Swarm-Corporation/Xray-Bench

XRayBench is a state-of-the-art evaluation platform designed specifically...

24
Experimental
60 xyva-yuangui/smartness-eval

🎯 12-Dimension AI Agent Intelligence Assessment | 12维度 AI Agent 智能度自动评估技能 |...

24
Experimental
61 nagu-io/agent-settlement-bench

Benchmark for evaluating safety of AI agents in irreversible financial...

24
Experimental
62 IlyasFardaouix/Agent-racing-league

The world's first racing league for AI agents. Think F1 ,but the drivers are AI.

23
Experimental
63 pauldebdeep9/awesome-agentic-evaluation

A curated list of benchmarks, environments, papers, and tooling for agentic...

23
Experimental
64 yazcaleb/can-is-not-may

Authority Models for Governable AI Agents — paper, AuthorityBench (54...

23
Experimental
65 DUBSOpenHub/shadow-score-spec

A framework-agnostic metric for measuring AI code generation quality....

23
Experimental
66 clouatre-labs/llm-agent-experiments

Benchmarking open-weight LLM coding agents as SCOUT delegates: model...

23
Experimental
67 dataanswer/awesome-agent-benchmarks

A curated collection of the world’s most advanced benchmark datasets for...

23
Experimental
68 azurefr/AI-Arena---Benchmarking-Platform-for-Autonomous-AI-Agents

Benchmark autonomous AI agents by measuring their reasoning and competitive...

22
Experimental
69 joshualamerton/agent-evaluation-lab

Sandbox platform for testing and evaluating autonomous agents

22
Experimental
70 osheryadgar/tendedloop-arena

Python SDK for TendedLoop Arena — multi-agent gamification research...

22
Experimental
71 Ethandata/crucible-sim

Crucible — The Economic Autonomy Standard. Stress-test AI agents under...

22
Experimental
72 MukundaKatta/AgentBench

Agent evaluation and benchmarking suite — accuracy, efficiency, and tool...

22
Experimental
73 Vinashu/razor-cascade

Framework to benchmark same-provider LLM cascading and measure API cost,...

22
Experimental
74 choutos/agent-eval-framework

Lightweight, practical evaluation framework for AI agents in production....

22
Experimental
75 dario-github/agent-self-evolution

Automated evaluation, ablation testing, and continuous improvement framework...

22
Experimental
76 ristponex/awesome-minimax-m2.7

🧠 Awesome MiniMax M2.7 — Self-evolving coding AI. Integrations, benchmarks,...

22
Experimental
77 davidgracemann/statma

stat-my-agent ; benchmark consistency, tool-use, failure-recovery and...

22
Experimental
78 evan66547/Contract-Reviewer-Agent-Eval

⚖️ Benchmark evaluation framework for AI-powered legal contract review...

22
Experimental
79 dairongzhen3-creator/illusion-of-emergence

Why your multi-agent LLM deception experiment might be measuring prompt...

22
Experimental
80 widingmarcus-cyber/opengym

240 challenges to test if your AI agent actually works — not just the model,...

22
Experimental
81 diorwave/agent-playground

A minimal sandbox to run, score, and compare AI agent outputs locally.

22
Experimental
82 alexmar07/agent-arena

A self-regulating arena where AI agents compete for work through sealed-bid auctions

22
Experimental
83 yiyangzhang-ai/open-agent-eval

Lightweight open-source toolkit for evaluating tool-calling AI agents on...

22
Experimental
84 AaronZhou-THU/agent-eval-workbench

A practical workbench for prompt, model, and mocked workflow evaluation with...

22
Experimental
85 tsanthoshreddy/agent-qa-lab

Trace-aware regression harness for tool-using Strands agents with...

22
Experimental
86 someonehereexists/AI-Arena---Benchmarking-Platform-for-Autonomous-AI-Agents

AI Arena is a competitive evaluation framework where multiple AI agents...

22
Experimental
87 BayramAnnakov/eval-coach

Agent Skill for Evaluation-Driven Development (EDD) - guide AI evaluation...

22
Experimental
88 justindobbs/awesome-certified-agents

A community catalog of autonomous agents and bundles certified by passing...

22
Experimental
89 eliumusk/agentreflect

AI agent self-reflection & self-evaluation tool. Built by an AI, for AIs.

21
Experimental
90 kadubon/search-stability-lab

Theory-to-experiment lab for search stability in long-running agents under...

21
Experimental
91 FishIntelGlobal/uncertainty-axioms

Computational validation suite for The First Principles of Uncertainty...

21
Experimental
92 dikatwoone/FluxCodeBench

🔍 Evaluate LLM agents on multi-phase programming tasks with FluxCodeBench,...

21
Experimental
93 thisisyoussef/ghostfolio-agent-eval-dataset

Deterministic golden eval dataset for finance-domain agent testing...

21
Experimental
94 deathlabs/sunshower

Declarative and Distributed Benchmarking for AI Agents

21
Experimental
95 digital-rain-tech/ara-eval

ARA-Eval: Agentic Readiness Assessment — evaluation framework for...

21
Experimental
96 axxafo/awesome-agent-benchmarks

🧠 Discover and evaluate advanced benchmark datasets for Large Language Model...

21
Experimental
97 mireya001/evalops-kit

CI-native evals for tool-using agents: datasets, traces, deterministic...

21
Experimental
98 robobobby/agenteval

Behavior test framework for AI agents. Define tests in YAML. Run against...

21
Experimental
99 SainathPattipati/agent-evaluation-harness

Framework to benchmark and evaluate multi-agent system performance,...

21
Experimental
100 HomenShum/nodebench-boilerplate

Production-ready boilerplate for AI agent projects using NodeBench MCP. 129...

21
Experimental
101 fraction12/open-rank

The open benchmark for AI agents — daily puzzles, public rankings

21
Experimental
102 1sdeb/sidemind.ai

AI Assurance Metrics Analyzer - Evaluate LLM outputs with 15 quality...

21
Experimental
103 greynewell/swe-bench-pro-action

GitHub Action for SWE-bench Pro evaluation powered by mcpbr

21
Experimental
104 jstilb/meaningful_metrics

Open-source evaluation frameworks for human-centered metrics, AI evaluation...

21
Experimental
105 leaderboard-md/spec

LEADERBOARD.md — Open standard for AI agent performance benchmarking. Track...

21
Experimental
106 akshan-main/equitas-benchmark

Corruption-robustness benchmark for hierarchical multi-LLM committees

21
Experimental
107 GZQKCHQM/M_bench

Measure Apple Silicon performance for Python and NumPy workloads, providing...

21
Experimental
108 Jesutofunmie/Haiku-4.5-vs-Minimax-2.1

🧠 Benchmark Haiku 4.5 and MiniMax M2.1 on agentic tasks, revealing strengths...

21
Experimental
109 lechmazur/pgg_bench

Public Goods Game (PGG) Benchmark: Contribute & Punish is a multi-agent...

21
Experimental
110 jstilb/llm-eval-framework

LLM evaluation framework with custom metrics, LLM-as-judge, and...

21
Experimental
111 pyros-projects/agent-comparison

Qualitative benchmark suite for evaluating AI coding agents and...

21
Experimental
112 zahere/stochastic-circuit-breaker

Statistically optimal circuit breaker for stochastic systems. 4-state...

21
Experimental
113 Pashasan/llm_price_sensitivity_evaluation

Conjoint experiment measuring price sensitivity and economic preferences of...

21
Experimental
114 speed785/evalforge

Agent Evaluation Harness — write repeatable, measurable evals for AI agents....

21
Experimental
115 NeoSkillFactory/llm-benchmark

Automatically benchmarks LLM responses across multiple models using...

21
Experimental
116 datalayer-challenges/dabench-leaderboard

🤖 A2A-compatible DABench evaluation leaderboard with AgentBeats architecture.

20
Experimental
117 BAAI-Agents/SWITCH

SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in...

19
Experimental
118 ian-flores/securebench

Evaluation and benchmarking framework for R LLM agents

19
Experimental
119 jonradoff/hiddenbench

HiddenBench: Benchmark for evaluating collective reasoning in multi-agent LLM systems

19
Experimental
120 manishklach/agentic_cpu_bottleneck_bench

Vendor-neutral simulator + benchmark for agent runtime overhead: fan-out,...

19
Experimental
121 vvsotnikov/astro-bench

Can AI agents do real science? Benchmarking AI agents on KASCADE cosmic ray...

19
Experimental
122 Emersoft76/ai-agent-systems-advanced-benchmarking

Modular AI agent system with LLMs, tools, and benchmark optimization

18
Experimental
123 tostechbr/evoloop

Framework-agnostic eval toolkit for AI agents — capture traces, judge...

17
Experimental
124 josephsenior/agent-evaluation-platform

🚀 Professional-grade AI Agent Evaluation Platform. Multi-provider...

17
Experimental
125 vectorize-io/hindsight-benchmarks

Hindsight Benchmarks Results

17
Experimental
126 messeb/py-deepeval-behave-bdd-testing-example

An example that combines Behave (BDD testing) with DeepEval (LLM evaluation)...

15
Experimental
127 memvid/memvidbench

Benchmark tool for evaluating Memvid on the LoCoMo (Long-term Conversational...

14
Experimental
128 patrikmarshall/opencode-benchmark-dashboard

Measure and compare speed and accuracy of large language models using...

14
Experimental
129 graciegould/agent-performance-tests

Benchmarks how codebase structure affects AI agent efficiency — tool calls,...

14
Experimental
130 crabsatellite/lem-experiments

Reproducible experiments for: LLM Exposure Monitoring — A Security Framework...

14
Experimental
131 Lap-Platform/Lap-benchmark-docs

LAP benchmark results — 500 runs, 50 specs, 5 formats. Agents run 35%...

14
Experimental
132 Software-Engineering-Arena/SWE-Agent-Arena

Compare agents pairwise via multi‑round evaluations for SE tasks.

14
Experimental
133 Red1-Rahman/Prompt-Eval

Streamlit prompt evaluation tool that auto-generates test cases, run evals,...

14
Experimental
134 jamjet-labs/jamjet-benchmarks

JamJet benchmarks, migration guides, and feature comparisons vs LangGraph,...

14
Experimental
135 Ritvik777/Galileo_Project

Galileo: Observations and Evals

14
Experimental
136 avdolgikh/poker-coach-eval-harness

LLM-powered evaluation harness for detecting orchestration failures in AI...

14
Experimental
137 memstate-ai/memstate-benchmark

Open-source benchmark for AI agent memory systems — compare Memstate, mem0,...

14
Experimental
138 prajaktapandit7/conversational-AI-evaluation

Structured evaluation of 30 support bot conversations measuring containment,...

13
Experimental
139 brianjmarvin/datasnack-ai

The DataSnack AI Agent Evaluator is a CLI tool that automates the testing of...

13
Experimental
140 corradocavalli/agentic_evaluation

Demonstration of testing and evaluation patterns for AI agents using Azure...

13
Experimental
141 lintware/AI_Agent_Frameworks_Comparison

Benchmark comparing 8 AI agent frameworks (SmolAgents, OpenAI Agents SDK,...

13
Experimental
142 ImSudhakar07/RivalReview-Evals

An eval platform that continuously monitors the quality of the /RivalReview...

13
Experimental
143 yzotop/ab-factory-demo

Deterministic multi-agent A/B test evaluation system with policy engine,...

13
Experimental
144 Jojodicus/ai-identity-benchmark

Does the identity in a system prompt change performance?

13
Experimental
145 EmZod/Earth-Magnetic-Field-Research-Minimax-w-subagents-in-pi-

Multi-agent research orchestration using MiniMax-M2.1 with thinking enabled....

13
Experimental
146 codedbyelif/els-judge

Multi-LLM consensus engine for automated code review, diff analysis, and...

13
Experimental
147 EmZod/Haiku-4.5-vs-Minimax-2.1

Systematic benchmark comparing Claude Haiku 4.5 vs MiniMax M2.1 on agentic...

13
Experimental
148 mohsinsheikhani/support-fte-evals

Eval-driven Customer Support FTE using OpenAI Agents SDK. Multi-agent...

11
Experimental
149 rithirs/Autoeval

Automated evaluation framework using multi-agent backend with FastAPI for...

10
Experimental

Comparisons in this category