terryyz/llm-benchmark

A list of LLM benchmark frameworks.

/ 100

Emerging

This is a curated list of tools for evaluating Large Language Models (LLMs). It helps AI researchers, machine learning engineers, and data scientists choose the right benchmark for assessing an LLM's capabilities. You can compare various evaluation frameworks, understand their datasets, and select the most suitable one for your specific LLM project.

No commits in the last 6 months.

Use this if you are a researcher or engineer who needs to systematically compare the performance of different Large Language Models across various tasks and datasets.

Not ideal if you are looking for a tool to develop or fine-tune LLMs, as this focuses solely on evaluating existing models.

LLM evaluation AI research natural language processing machine learning engineering model benchmarking

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 9 / 25

Maturity 16 / 25

Community 10 / 25

How are scores calculated?

Stars

Forks

Language

—

License

Apache-2.0

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

sierra-research/tau2-bench

τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment

xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

scicode-bench/SciCode

A benchmark that challenges language models to code solutions for scientific problems

Explore LLM Tools

All categories Trending LLM Tool directory Insights