Domain Specific Benchmarks AI Agents

There are 6 domain specific benchmarks agents tracked. 2 score above 50 (established tier). The highest-rated is Tongyi-MAI/MobileWorld at 52/100 with 152 stars.

Get all 6 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=agents&subcategory=domain-specific-benchmarks&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Agent	Score	Tier	Stars	Language
1	Tongyi-MAI/MobileWorld Benchmarking Autonomous Mobile Agents in Agent-User Interactive and...	52	Established	152	Python
2	OSU-NLP-Group/ScienceAgentBench [ICLR'25] ScienceAgentBench: Toward Rigorous Assessment of Language Agents...	52	Established	128	Python
3	ml-dev-bench/ml-dev-bench ML-Dev-Bench is a benchmark for evaluating AI agents against various ML...	44	Emerging	41	Python
4	michaelabrt/clarte-benchmark Paired A/B benchmark suite for Clarté - measures how dependency-graph...	22	Experimental	—	TypeScript
5	zzhiyuann/agent-bench Benchmarking framework for AI agents — pytest for AI agents. Define tasks in...	22	Experimental	—	TypeScript
6	MSKazemi/ExaBench-QA ExaBench-QA is a benchmark and dataset for evaluating role-aware, LLM-based...	17	Experimental	—	Jupyter Notebook