Domain-Specific Benchmarks LLM Tools

Benchmarks evaluating LLMs on specialized knowledge domains (legal, OSINT, cyber, numerical reasoning, KGs) and role-playing tasks. Does NOT include general-purpose LLM evaluation, vision-language model benchmarks, or cultural alignment tests.

There are 117 domain-specific benchmarks tools tracked. 1 score above 70 (verified tier). The highest-rated is sierra-research/tau2-bench at 72/100 with 829 stars. 2 of the top 10 are actively maintained.

Get all 117 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=domain-specific-benchmarks&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 sierra-research/tau2-bench

τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment

72
Verified
2 xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks...

66
Established
3 bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

61
Established
4 scicode-bench/SciCode

A benchmark that challenges language models to code solutions for scientific problems

55
Established
5 THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

55
Established
6 swefficiency/swefficiency

Benchmark harness and code for "SWE-fficiency: Can Language Models Optimize...

54
Established
7 logic-star-ai/swt-bench

[NeurIPS 2024] Evaluation harness for SWT-Bench, a benchmark for evaluating...

54
Established
8 microsoft/SWE-bench-Live

[NeurIPS 2025 D&B] 🚀 SWE-bench Goes Live!

53
Established
9 alibaba/sec-code-bench

SecCodeBench is a benchmark suite focusing on evaluating the security of...

52
Established
10 open-compass/LawBench

Benchmarking Legal Knowledge of Large Language Models

48
Emerging
11 langchain-ai/langchain-benchmarks

🦜💯 Flex those feathers!

48
Emerging
12 OpenGenerativeAI/llm-colosseum

Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the...

47
Emerging
13 camel-ai/crab

🦀️ CRAB: Cross-environment Agent Benchmark for Multimodal Language Model...

47
Emerging
14 web-arena-x/visualwebarena

VisualWebArena is a benchmark for multimodal agents.

47
Emerging
15 rentruewang/bocoel

Bayesian Optimization as a Coverage Tool for Evaluating LLMs. Accurate...

47
Emerging
16 HUST-AI-HYZ/MemoryAgentBench

Open source code for ICLR 2026 Paper: Evaluating Memory in LLM Agents via...

46
Emerging
17 X-PLUG/WritingBench

WritingBench: A Comprehensive Benchmark for Generative Writing

46
Emerging
18 zhangxjohn/LLM-Agent-Benchmark-List

A banchmark list for evaluation of large language models.

46
Emerging
19 IBM/ACPBench

ACPBench: Reasoning about Action, Change, and Planning. A benchmark...

46
Emerging
20 cornell-zhang/heurigym

Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization (ICLR'26)

45
Emerging
21 OskarsEzerins/llm-benchmarks

Popular LLM benchmarks for ruby code generation

45
Emerging
22 OceanGPT/OceanGym

OceanGym: A Benchmark Environment for Underwater Embodied Agents

45
Emerging
23 AKSW/LLM-KG-Bench

LLM-KG-Bench is a Framework and task collection for automated benchmarking...

43
Emerging
24 mims-harvard/CUREBench

CUREBench @ NeurIPS 2025: Benchmarking AI reasoning for therapeutic...

43
Emerging
25 lavantien/llm-tournament

Simple and blazingly fast dynamic evaluation platform for benchmarking Large...

43
Emerging
26 principia-ai/PhysGym

A benchmark suite for evaluating LLM-based interactive scientific reasoning.

43
Emerging
27 actiontech/sql-llm-benchmark

SCALE: SQL Capability Leaderboard for LLMs

42
Emerging
28 THUNLP-MT/StableToolBench

A new tool learning benchmark aiming at well-balanced stability and reality,...

42
Emerging
29 arthur-ai/bench

A tool for evaluating LLMs

42
Emerging
30 Ammaar-Alam/minebench

Minecraft-style voxel benchmark for comparing AI models (Arena + Sandbox)

42
Emerging
31 srikanth235/benchllama

Benchmark your local LLMs.

40
Emerging
32 IlyaGusev/ping_pong_bench

A benchmark for role-playing language models

40
Emerging
33 braingpt-lovelab/BrainBench

Source code for

40
Emerging
34 adobe-research/NoLiMa

Official repository for "NoLiMa: Long-Context Evaluation Beyond Literal Matching"

40
Emerging
35 ByteDance-Seed/WideSearch

WideSearch: Benchmarking Agentic Broad Info-Seeking

40
Emerging
36 GoodAI/goodai-ltm-benchmark

A library for benchmarking the Long Term Memory and Continual learning...

40
Emerging
37 LiqiangJing/DSBench

[ICLR 2025] DSBench: How Far are Data Science Agents from Becoming Data...

39
Emerging
38 InternScience/SGI-Bench

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

39
Emerging
39 msu-denver/bili-core

bili-core is an open-source framework for LLM benchmarking using LangChain,...

39
Emerging
40 mazzzystar/TurtleBench

TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles.

39
Emerging
41 gersteinlab/ML-Bench

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning...

38
Emerging
42 SAP-samples/llm-agents-eval-tutorial

Tutorial Materials for the paper "Evaluation & Benchmarking of LLM Agents: A...

38
Emerging
43 eth-lre/mathtutorbench

Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors,...

38
Emerging
44 stevesolun/Chameleon

🦎 Benchmark LLM robustness under semantic paraphrasing. Tests how models...

37
Emerging
45 jpmorganchase/CyberBench

CyberBench: A Multi-Task Cyber LLM Benchmark

37
Emerging
46 humanlaya/OneMillion-Bench

Official evals for $OneMillion-Bench

37
Emerging
47 lechmazur/nyt-connections

Benchmark that evaluates LLMs using 759 NYT Connections puzzles extended...

37
Emerging
48 THUDM/AlignBench

大模型多维度中文对齐评测基准 (ACL 2024)

37
Emerging
49 THUDM/VisualAgentBench

Towards Large Multimodal Models as Visual Foundation Agents

37
Emerging
50 Q-Future/Q-Bench

①[ICLR2024 Spotlight] (GPT-4V/Gemini-Pro/Qwen-VL-Plus+16 OS MLLMs) A...

36
Emerging
51 rohanelukurthy/rig-rank

A Go CLI tool to benchmark local LLMs via Ollama, measuring Time To First...

36
Emerging
52 MetriLLM/metrillm

Benchmark local LLM models: speed, quality, and hardware fitness scoring....

36
Emerging
53 parameterlab/c-seo-bench

Source code of "C-SEO Bench: Does Conversational SEO Work?" NeurIPS D&B 2025

36
Emerging
54 Laoyu84/4onebench

A minimalist benchmarking tool designed to test the routine-generation...

35
Emerging
55 TrustAIRLab/HateBench

[USENIX'25] HateBench: Benchmarking Hate Speech Detectors on LLM-Generated...

35
Emerging
56 Cybonto/OllaBench

Evaluating LLMs' Cognitive Behavioral Reasoning for Cybersecurity

35
Emerging
57 terryyz/llm-benchmark

A list of LLM benchmark frameworks.

35
Emerging
58 ImBIOS/thiqah-ops

AI SysAdmin Trust Benchmark - Comprehensive testing suite for evaluating LLM...

35
Emerging
59 ag-sc/Robo-CSK-Benchmark

Benchmark for evaluating Embodied Commonsense Capabilities (e.g. of LLMs)

34
Emerging
60 YerbaPage/SWE-Exp

SWE-Exp: Experience-Driven Software Issue Resolution

34
Emerging
61 ccmdi/osintbench

OSINT benchmark for language models

34
Emerging
62 EachSheep/ShortcutsBench

ShortcutsBench: A Large-Scale Real-World Benchmark for API-Based Agents

34
Emerging
63 FreedomIntelligence/MTalk-Bench

MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via...

33
Emerging
64 OpenGVLab/Multi-Modality-Arena

Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to...

33
Emerging
65 ormeilu/RuCa

RuCa Benchmark (pronounced "roo-ka") - Russian Tool Calling Benchmark for LLM

32
Emerging
66 research-outcome/LLM-Game-Benchmark

Evaluating Large Language Models with Grid-Based Game Competitions: An...

32
Emerging
67 TheDuckAI/arb

Advanced Reasoning Benchmark Dataset for LLMs

31
Emerging
68 ScholarXIV/enkokilish_bench

Amharic Riddle Benchmark for LLMs

30
Emerging
69 zjunlp/ChineseHarm-bench

ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark

30
Emerging
70 CAS-SIAT-XinHai/CPsyExam

[COLING 2025] CPsyExam: A Chinese Benchmark for Evaluating Psychology using...

29
Experimental
71 zihao-ai/EARBench

Benchmarking Physical Risk Awareness of Foundation Model-based Embodied AI Agents

29
Experimental
72 stalkermustang/llm-bulls-and-cows-benchmark

A mini-framework for evaluating LLM performance on the Bulls and Cows number...

28
Experimental
73 nttmdlab-nlp/ToMATO

ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking...

28
Experimental
74 dylan-slack/Tablet

The TABLET benchmark for evaluating instruction learning with LLMs for...

28
Experimental
75 Swival/calibra

A benchmarking harness for coding agents.

27
Experimental
76 IAAR-Shanghai/NewsBench

[ACL 2024 Main] NewsBench: A Systematic Evaluation Framework for Assessing...

26
Experimental
77 oaimli/SciTrek

Benchmarking long-context reasoning on scientific articles

26
Experimental
78 caixd-220529/LifelongAgentBench

Code repo for "LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners"

26
Experimental
79 jordan-gibbs/secret-hitler-bench

An LLM benchmark based on the popular social deception game, Secret Hitler....

26
Experimental
80 ChutaVeias/thiqah-ops

🤖 Evaluate AI competence in sysadmin tasks with ThiqahOps, a benchmark suite...

25
Experimental
81 Boopi7/brain-bench

Source code for

25
Experimental
82 umayer16/VIBEBENCH

An automated framework for holistic evaluation of LLM-generated code using...

25
Experimental
83 VisualWebBench/VisualWebBench

Evaluation framework for paper "VisualWebBench: How Far Have Multimodal LLMs...

25
Experimental
84 mnbplus/llm-gateway-bench

CLI benchmark suite for LLM providers and OpenAI-compatible gateways....

25
Experimental
85 lechmazur/generalization

Thematic Generalization Benchmark: measures how effectively various LLMs can...

23
Experimental
86 GAIR-NLP/benbench

Benchmarking Benchmark Leakage in Large Language Models

23
Experimental
87 Visual-AI/GAMEBoT

[ACL 2025] GAMEBoT: Transparent Assessment of LLM Reasoning in Games

23
Experimental
88 HSTRG1/GHOST_benchmarks

A collection of hardware Trojans (HTs) automatically generated by Large...

22
Experimental
89 ArbitrHq/ocr-mini-bench

Official OCR mini-bench repository for public use.

22
Experimental
90 wimi321/task-bundle

Turn AI coding runs into portable, replayable, benchmark-ready task bundles.

22
Experimental
91 KID-22/Cocktail

Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated...

22
Experimental
92 VTSTech/VTSTech-GPTBench

Benchmark Ollama Models for Instruction Following, Tool Calling and Agent Workflows

22
Experimental
93 hra42/krites

LLM benchmark platform comparing models with real-time streaming, metrics,...

22
Experimental
94 KandyBoi1/enkokilish_bench

🧩 Benchmark LLMs on their ability to solve Amharic riddles using Evalite for...

22
Experimental
95 pvlbzn/latai

LatAI – A latency benchmarking tool for evaluating multiple generative AI...

21
Experimental
96 contactvaibhavi/GVR-Bench

Pipeline to investigate structured reasoning and instruction adherence in...

21
Experimental
97 0xsomesh/rawbench

RawBench: Powerful, minimal framework for LLM prompt evaluation with YAML...

21
Experimental
98 MarcT0K/TOSSS-LLM-Benchmark

TOSSS, an extensible LLM security benchmark based on the CVE database

21
Experimental
99 PrimisAI/arcbench

A benchmark for evaluating advanced reasoning in language models and...

21
Experimental
100 Mr-Dark-debug/RetardBench

RetardBench is an open, no-censorship benchmark that ranks large language...

21
Experimental
101 NickRiccardi/two-word-test

Two Word Test: Combinatorial Semantic Benchmark for LLMs

20
Experimental
102 Antix5/ProductBench

This is a benchmark to see LLMs ability to understand complex product...

19
Experimental
103 lemon07r/SanityBoard

Home of the SanityHarness Leaderboard website.

18
Experimental
104 abronte/wordlebench

WordleBench is a benchmark for evaluating LLMs on their ability to solve...

17
Experimental
105 kreasof-ai/infinite-benchmark-glitch

We Found an Infinite Benchmark Glitch: Dynamic N-Dimensional Grid Regression...

17
Experimental
106 alextyhwang/Chatio-LLM-Benchmark

The benchmark for real-world helpfulness. Evaluating LLMs on empathy,...

17
Experimental
107 gqgs/llm100kbench

LLM 100k portfolio management benchmark

16
Experimental
108 jdleo/weirdbench

Open-source LLM benchmarking site for unconventional evals, with local...

15
Experimental
109 GAIR-NLP/BeHonest

BeHonest: Benchmarking Honesty in Large Language Models

15
Experimental
110 not-shivansh/AI-Bench-AI-Evaluation

AI benchmarking platform using Groq (LLaMA 3.1) with hybrid NLP evaluation...

14
Experimental
111 Overarm-philippinecedar244/blindbench

Diagnose reasoning errors in large language models using blind human voting...

14
Experimental
112 Rootly-AI-Labs/GMCQ-benchmark

Evaluation benchmark for language models to understand code to close pull requests.

14
Experimental
113 xInfer123/octobench

Benchmark and compare LLM tool, configuration, and prompt setups using a...

14
Experimental
114 JanFalkin/llmbench

pprof for LLM inference. Benchmark and analyze performance of...

13
Experimental
115 guhcostan/gym-ai-benchmark

AI Benchmark for Physical Education and Gym Training Knowledge - Evaluate...

13
Experimental
116 brianpeiris/llm-basic-letter-counting-benchmark

A basic letter-counting benchmark for LLMs

12
Experimental
117 reuank/ThinkBench

ThinkBench is an LLM benchmarking tool focused on evaluating the...

11
Experimental