TheDuckAI/arb

Advanced Reasoning Benchmark Dataset for LLMs

/ 100

Emerging

This project helps AI researchers and developers thoroughly test how well large language models (LLMs) understand complex text and apply expert knowledge. It takes a specific LLM and evaluates its ability to answer challenging questions across scientific and legal domains. The output is a performance score, showing how capable the LLM is at advanced reasoning tasks.

No commits in the last 6 months.

Use this if you are developing or evaluating large language models and need a robust way to benchmark their deep comprehension and expert reasoning capabilities across various challenging subjects.

Not ideal if you are looking for a tool to train an LLM or to apply an LLM for general knowledge retrieval, as its primary purpose is rigorous evaluation.

LLM-evaluation AI-benchmarking natural-language-understanding reasoning-assessment AI-research

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 8 / 25

Maturity 16 / 25

Community 7 / 25

How are scores calculated?

Stars

Forks

Language

TypeScript

License

MIT

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

sierra-research/tau2-bench

τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment

xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

scicode-bench/SciCode

A benchmark that challenges language models to code solutions for scientific problems

Explore LLM Tools

All categories Trending LLM Tool directory Insights