zhangxjohn/LLM-Agent-Benchmark-List

A banchmark list for evaluation of large language models.

/ 100

Emerging

This resource helps AI researchers and developers understand and compare how well Large Language Models (LLMs) and LLM-powered agents perform on different tasks. It provides a structured list of benchmarks, including papers and project pages, allowing you to select appropriate evaluation methods for specific LLM applications. This is for anyone building, researching, or deploying LLMs and agent systems who needs to rigorously assess their capabilities.

160 stars.

Use this if you are a researcher or developer trying to evaluate the effectiveness, reasoning, tool-use, or knowledge integration of Large Language Models and AI agents.

Not ideal if you are looking for ready-to-use LLM models or tools for end-user applications rather than resources for evaluating their underlying performance.

AI-research LLM-evaluation agent-development model-benchmarking natural-language-processing

No Package No Dependents

Maintenance 10 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 10 / 25

How are scores calculated?

Stars

160

Forks

Language

—

License

Apache-2.0

Featured in

You're Shipping AI You Can't Measure

Compare

LLM-Agent-Benchmark-List and AgentBench

Higher-rated alternatives

sierra-research/tau2-bench

τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment

xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

scicode-bench/SciCode

A benchmark that challenges language models to code solutions for scientific problems

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

Explore LLM Tools

All categories Trending LLM Tool directory Insights