TheDuckAI/arb

Advanced Reasoning Benchmark Dataset for LLMs

31
/ 100
Emerging

This project helps AI researchers and developers thoroughly test how well large language models (LLMs) understand complex text and apply expert knowledge. It takes a specific LLM and evaluates its ability to answer challenging questions across scientific and legal domains. The output is a performance score, showing how capable the LLM is at advanced reasoning tasks.

No commits in the last 6 months.

Use this if you are developing or evaluating large language models and need a robust way to benchmark their deep comprehension and expert reasoning capabilities across various challenging subjects.

Not ideal if you are looking for a tool to train an LLM or to apply an LLM for general knowledge retrieval, as its primary purpose is rigorous evaluation.

LLM-evaluation AI-benchmarking natural-language-understanding reasoning-assessment AI-research
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 8 / 25
Maturity 16 / 25
Community 7 / 25

How are scores calculated?

Stars

47

Forks

3

Language

TypeScript

License

MIT

Last pushed

Nov 19, 2023

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/TheDuckAI/arb"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.