zhangxjohn/LLM-Agent-Benchmark-List

A banchmark list for evaluation of large language models.

46
/ 100
Emerging

This resource helps AI researchers and developers understand and compare how well Large Language Models (LLMs) and LLM-powered agents perform on different tasks. It provides a structured list of benchmarks, including papers and project pages, allowing you to select appropriate evaluation methods for specific LLM applications. This is for anyone building, researching, or deploying LLMs and agent systems who needs to rigorously assess their capabilities.

160 stars.

Use this if you are a researcher or developer trying to evaluate the effectiveness, reasoning, tool-use, or knowledge integration of Large Language Models and AI agents.

Not ideal if you are looking for ready-to-use LLM models or tools for end-user applications rather than resources for evaluating their underlying performance.

AI-research LLM-evaluation agent-development model-benchmarking natural-language-processing
No Package No Dependents
Maintenance 10 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 10 / 25

How are scores calculated?

Stars

160

Forks

9

Language

License

Apache-2.0

Last pushed

Feb 26, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/zhangxjohn/LLM-Agent-Benchmark-List"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.