zhangxjohn/LLM-Agent-Benchmark-List
A banchmark list for evaluation of large language models.
This resource helps AI researchers and developers understand and compare how well Large Language Models (LLMs) and LLM-powered agents perform on different tasks. It provides a structured list of benchmarks, including papers and project pages, allowing you to select appropriate evaluation methods for specific LLM applications. This is for anyone building, researching, or deploying LLMs and agent systems who needs to rigorously assess their capabilities.
160 stars.
Use this if you are a researcher or developer trying to evaluate the effectiveness, reasoning, tool-use, or knowledge integration of Large Language Models and AI agents.
Not ideal if you are looking for ready-to-use LLM models or tools for end-user applications rather than resources for evaluating their underlying performance.
Stars
160
Forks
9
Language
—
License
Apache-2.0
Category
Last pushed
Feb 26, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/zhangxjohn/LLM-Agent-Benchmark-List"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
sierra-research/tau2-bench
τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment
xlang-ai/OSWorld
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
bigcode-project/bigcodebench
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
scicode-bench/SciCode
A benchmark that challenges language models to code solutions for scientific problems
THUDM/AgentBench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)