Domain-Specific Benchmarks LLM Tools
Benchmarks evaluating LLMs on specialized knowledge domains (legal, OSINT, cyber, numerical reasoning, KGs) and role-playing tasks. Does NOT include general-purpose LLM evaluation, vision-language model benchmarks, or cultural alignment tests.
There are 117 domain-specific benchmarks tools tracked. 1 score above 70 (verified tier). The highest-rated is sierra-research/tau2-bench at 72/100 with 829 stars. 2 of the top 10 are actively maintained.
Get all 117 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=domain-specific-benchmarks&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
sierra-research/tau2-bench
τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment |
|
Verified |
| 2 |
xlang-ai/OSWorld
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks... |
|
Established |
| 3 |
bigcode-project/bigcodebench
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI |
|
Established |
| 4 |
scicode-bench/SciCode
A benchmark that challenges language models to code solutions for scientific problems |
|
Established |
| 5 |
THUDM/AgentBench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24) |
|
Established |
| 6 |
swefficiency/swefficiency
Benchmark harness and code for "SWE-fficiency: Can Language Models Optimize... |
|
Established |
| 7 |
logic-star-ai/swt-bench
[NeurIPS 2024] Evaluation harness for SWT-Bench, a benchmark for evaluating... |
|
Established |
| 8 |
microsoft/SWE-bench-Live
[NeurIPS 2025 D&B] 🚀 SWE-bench Goes Live! |
|
Established |
| 9 |
alibaba/sec-code-bench
SecCodeBench is a benchmark suite focusing on evaluating the security of... |
|
Established |
| 10 |
open-compass/LawBench
Benchmarking Legal Knowledge of Large Language Models |
|
Emerging |
| 11 |
langchain-ai/langchain-benchmarks
🦜💯 Flex those feathers! |
|
Emerging |
| 12 |
OpenGenerativeAI/llm-colosseum
Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the... |
|
Emerging |
| 13 |
camel-ai/crab
🦀️ CRAB: Cross-environment Agent Benchmark for Multimodal Language Model... |
|
Emerging |
| 14 |
web-arena-x/visualwebarena
VisualWebArena is a benchmark for multimodal agents. |
|
Emerging |
| 15 |
rentruewang/bocoel
Bayesian Optimization as a Coverage Tool for Evaluating LLMs. Accurate... |
|
Emerging |
| 16 |
HUST-AI-HYZ/MemoryAgentBench
Open source code for ICLR 2026 Paper: Evaluating Memory in LLM Agents via... |
|
Emerging |
| 17 |
X-PLUG/WritingBench
WritingBench: A Comprehensive Benchmark for Generative Writing |
|
Emerging |
| 18 |
zhangxjohn/LLM-Agent-Benchmark-List
A banchmark list for evaluation of large language models. |
|
Emerging |
| 19 |
IBM/ACPBench
ACPBench: Reasoning about Action, Change, and Planning. A benchmark... |
|
Emerging |
| 20 |
cornell-zhang/heurigym
Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization (ICLR'26) |
|
Emerging |
| 21 |
OskarsEzerins/llm-benchmarks
Popular LLM benchmarks for ruby code generation |
|
Emerging |
| 22 |
OceanGPT/OceanGym
OceanGym: A Benchmark Environment for Underwater Embodied Agents |
|
Emerging |
| 23 |
AKSW/LLM-KG-Bench
LLM-KG-Bench is a Framework and task collection for automated benchmarking... |
|
Emerging |
| 24 |
mims-harvard/CUREBench
CUREBench @ NeurIPS 2025: Benchmarking AI reasoning for therapeutic... |
|
Emerging |
| 25 |
lavantien/llm-tournament
Simple and blazingly fast dynamic evaluation platform for benchmarking Large... |
|
Emerging |
| 26 |
principia-ai/PhysGym
A benchmark suite for evaluating LLM-based interactive scientific reasoning. |
|
Emerging |
| 27 |
actiontech/sql-llm-benchmark
SCALE: SQL Capability Leaderboard for LLMs |
|
Emerging |
| 28 |
THUNLP-MT/StableToolBench
A new tool learning benchmark aiming at well-balanced stability and reality,... |
|
Emerging |
| 29 |
arthur-ai/bench
A tool for evaluating LLMs |
|
Emerging |
| 30 |
Ammaar-Alam/minebench
Minecraft-style voxel benchmark for comparing AI models (Arena + Sandbox) |
|
Emerging |
| 31 |
srikanth235/benchllama
Benchmark your local LLMs. |
|
Emerging |
| 32 |
IlyaGusev/ping_pong_bench
A benchmark for role-playing language models |
|
Emerging |
| 33 |
braingpt-lovelab/BrainBench
Source code for |
|
Emerging |
| 34 |
adobe-research/NoLiMa
Official repository for "NoLiMa: Long-Context Evaluation Beyond Literal Matching" |
|
Emerging |
| 35 |
ByteDance-Seed/WideSearch
WideSearch: Benchmarking Agentic Broad Info-Seeking |
|
Emerging |
| 36 |
GoodAI/goodai-ltm-benchmark
A library for benchmarking the Long Term Memory and Continual learning... |
|
Emerging |
| 37 |
LiqiangJing/DSBench
[ICLR 2025] DSBench: How Far are Data Science Agents from Becoming Data... |
|
Emerging |
| 38 |
InternScience/SGI-Bench
Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows |
|
Emerging |
| 39 |
msu-denver/bili-core
bili-core is an open-source framework for LLM benchmarking using LangChain,... |
|
Emerging |
| 40 |
mazzzystar/TurtleBench
TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles. |
|
Emerging |
| 41 |
gersteinlab/ML-Bench
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning... |
|
Emerging |
| 42 |
SAP-samples/llm-agents-eval-tutorial
Tutorial Materials for the paper "Evaluation & Benchmarking of LLM Agents: A... |
|
Emerging |
| 43 |
eth-lre/mathtutorbench
Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors,... |
|
Emerging |
| 44 |
stevesolun/Chameleon
🦎 Benchmark LLM robustness under semantic paraphrasing. Tests how models... |
|
Emerging |
| 45 |
jpmorganchase/CyberBench
CyberBench: A Multi-Task Cyber LLM Benchmark |
|
Emerging |
| 46 |
humanlaya/OneMillion-Bench
Official evals for $OneMillion-Bench |
|
Emerging |
| 47 |
lechmazur/nyt-connections
Benchmark that evaluates LLMs using 759 NYT Connections puzzles extended... |
|
Emerging |
| 48 |
THUDM/AlignBench
大模型多维度中文对齐评测基准 (ACL 2024) |
|
Emerging |
| 49 |
THUDM/VisualAgentBench
Towards Large Multimodal Models as Visual Foundation Agents |
|
Emerging |
| 50 |
Q-Future/Q-Bench
①[ICLR2024 Spotlight] (GPT-4V/Gemini-Pro/Qwen-VL-Plus+16 OS MLLMs) A... |
|
Emerging |
| 51 |
rohanelukurthy/rig-rank
A Go CLI tool to benchmark local LLMs via Ollama, measuring Time To First... |
|
Emerging |
| 52 |
MetriLLM/metrillm
Benchmark local LLM models: speed, quality, and hardware fitness scoring.... |
|
Emerging |
| 53 |
parameterlab/c-seo-bench
Source code of "C-SEO Bench: Does Conversational SEO Work?" NeurIPS D&B 2025 |
|
Emerging |
| 54 |
Laoyu84/4onebench
A minimalist benchmarking tool designed to test the routine-generation... |
|
Emerging |
| 55 |
TrustAIRLab/HateBench
[USENIX'25] HateBench: Benchmarking Hate Speech Detectors on LLM-Generated... |
|
Emerging |
| 56 |
Cybonto/OllaBench
Evaluating LLMs' Cognitive Behavioral Reasoning for Cybersecurity |
|
Emerging |
| 57 |
terryyz/llm-benchmark
A list of LLM benchmark frameworks. |
|
Emerging |
| 58 |
ImBIOS/thiqah-ops
AI SysAdmin Trust Benchmark - Comprehensive testing suite for evaluating LLM... |
|
Emerging |
| 59 |
ag-sc/Robo-CSK-Benchmark
Benchmark for evaluating Embodied Commonsense Capabilities (e.g. of LLMs) |
|
Emerging |
| 60 |
YerbaPage/SWE-Exp
SWE-Exp: Experience-Driven Software Issue Resolution |
|
Emerging |
| 61 |
ccmdi/osintbench
OSINT benchmark for language models |
|
Emerging |
| 62 |
EachSheep/ShortcutsBench
ShortcutsBench: A Large-Scale Real-World Benchmark for API-Based Agents |
|
Emerging |
| 63 |
FreedomIntelligence/MTalk-Bench
MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via... |
|
Emerging |
| 64 |
OpenGVLab/Multi-Modality-Arena
Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to... |
|
Emerging |
| 65 |
ormeilu/RuCa
RuCa Benchmark (pronounced "roo-ka") - Russian Tool Calling Benchmark for LLM |
|
Emerging |
| 66 |
research-outcome/LLM-Game-Benchmark
Evaluating Large Language Models with Grid-Based Game Competitions: An... |
|
Emerging |
| 67 |
TheDuckAI/arb
Advanced Reasoning Benchmark Dataset for LLMs |
|
Emerging |
| 68 |
ScholarXIV/enkokilish_bench
Amharic Riddle Benchmark for LLMs |
|
Emerging |
| 69 |
zjunlp/ChineseHarm-bench
ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark |
|
Emerging |
| 70 |
CAS-SIAT-XinHai/CPsyExam
[COLING 2025] CPsyExam: A Chinese Benchmark for Evaluating Psychology using... |
|
Experimental |
| 71 |
zihao-ai/EARBench
Benchmarking Physical Risk Awareness of Foundation Model-based Embodied AI Agents |
|
Experimental |
| 72 |
stalkermustang/llm-bulls-and-cows-benchmark
A mini-framework for evaluating LLM performance on the Bulls and Cows number... |
|
Experimental |
| 73 |
nttmdlab-nlp/ToMATO
ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking... |
|
Experimental |
| 74 |
dylan-slack/Tablet
The TABLET benchmark for evaluating instruction learning with LLMs for... |
|
Experimental |
| 75 |
Swival/calibra
A benchmarking harness for coding agents. |
|
Experimental |
| 76 |
IAAR-Shanghai/NewsBench
[ACL 2024 Main] NewsBench: A Systematic Evaluation Framework for Assessing... |
|
Experimental |
| 77 |
oaimli/SciTrek
Benchmarking long-context reasoning on scientific articles |
|
Experimental |
| 78 |
caixd-220529/LifelongAgentBench
Code repo for "LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners" |
|
Experimental |
| 79 |
jordan-gibbs/secret-hitler-bench
An LLM benchmark based on the popular social deception game, Secret Hitler.... |
|
Experimental |
| 80 |
ChutaVeias/thiqah-ops
🤖 Evaluate AI competence in sysadmin tasks with ThiqahOps, a benchmark suite... |
|
Experimental |
| 81 |
Boopi7/brain-bench
Source code for |
|
Experimental |
| 82 |
umayer16/VIBEBENCH
An automated framework for holistic evaluation of LLM-generated code using... |
|
Experimental |
| 83 |
VisualWebBench/VisualWebBench
Evaluation framework for paper "VisualWebBench: How Far Have Multimodal LLMs... |
|
Experimental |
| 84 |
mnbplus/llm-gateway-bench
CLI benchmark suite for LLM providers and OpenAI-compatible gateways.... |
|
Experimental |
| 85 |
lechmazur/generalization
Thematic Generalization Benchmark: measures how effectively various LLMs can... |
|
Experimental |
| 86 |
GAIR-NLP/benbench
Benchmarking Benchmark Leakage in Large Language Models |
|
Experimental |
| 87 |
Visual-AI/GAMEBoT
[ACL 2025] GAMEBoT: Transparent Assessment of LLM Reasoning in Games |
|
Experimental |
| 88 |
HSTRG1/GHOST_benchmarks
A collection of hardware Trojans (HTs) automatically generated by Large... |
|
Experimental |
| 89 |
ArbitrHq/ocr-mini-bench
Official OCR mini-bench repository for public use. |
|
Experimental |
| 90 |
wimi321/task-bundle
Turn AI coding runs into portable, replayable, benchmark-ready task bundles. |
|
Experimental |
| 91 |
KID-22/Cocktail
Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated... |
|
Experimental |
| 92 |
VTSTech/VTSTech-GPTBench
Benchmark Ollama Models for Instruction Following, Tool Calling and Agent Workflows |
|
Experimental |
| 93 |
hra42/krites
LLM benchmark platform comparing models with real-time streaming, metrics,... |
|
Experimental |
| 94 |
KandyBoi1/enkokilish_bench
🧩 Benchmark LLMs on their ability to solve Amharic riddles using Evalite for... |
|
Experimental |
| 95 |
pvlbzn/latai
LatAI – A latency benchmarking tool for evaluating multiple generative AI... |
|
Experimental |
| 96 |
contactvaibhavi/GVR-Bench
Pipeline to investigate structured reasoning and instruction adherence in... |
|
Experimental |
| 97 |
0xsomesh/rawbench
RawBench: Powerful, minimal framework for LLM prompt evaluation with YAML... |
|
Experimental |
| 98 |
MarcT0K/TOSSS-LLM-Benchmark
TOSSS, an extensible LLM security benchmark based on the CVE database |
|
Experimental |
| 99 |
PrimisAI/arcbench
A benchmark for evaluating advanced reasoning in language models and... |
|
Experimental |
| 100 |
Mr-Dark-debug/RetardBench
RetardBench is an open, no-censorship benchmark that ranks large language... |
|
Experimental |
| 101 |
NickRiccardi/two-word-test
Two Word Test: Combinatorial Semantic Benchmark for LLMs |
|
Experimental |
| 102 |
Antix5/ProductBench
This is a benchmark to see LLMs ability to understand complex product... |
|
Experimental |
| 103 |
lemon07r/SanityBoard
Home of the SanityHarness Leaderboard website. |
|
Experimental |
| 104 |
abronte/wordlebench
WordleBench is a benchmark for evaluating LLMs on their ability to solve... |
|
Experimental |
| 105 |
kreasof-ai/infinite-benchmark-glitch
We Found an Infinite Benchmark Glitch: Dynamic N-Dimensional Grid Regression... |
|
Experimental |
| 106 |
alextyhwang/Chatio-LLM-Benchmark
The benchmark for real-world helpfulness. Evaluating LLMs on empathy,... |
|
Experimental |
| 107 |
gqgs/llm100kbench
LLM 100k portfolio management benchmark |
|
Experimental |
| 108 |
jdleo/weirdbench
Open-source LLM benchmarking site for unconventional evals, with local... |
|
Experimental |
| 109 |
GAIR-NLP/BeHonest
BeHonest: Benchmarking Honesty in Large Language Models |
|
Experimental |
| 110 |
not-shivansh/AI-Bench-AI-Evaluation
AI benchmarking platform using Groq (LLaMA 3.1) with hybrid NLP evaluation... |
|
Experimental |
| 111 |
Overarm-philippinecedar244/blindbench
Diagnose reasoning errors in large language models using blind human voting... |
|
Experimental |
| 112 |
Rootly-AI-Labs/GMCQ-benchmark
Evaluation benchmark for language models to understand code to close pull requests. |
|
Experimental |
| 113 |
xInfer123/octobench
Benchmark and compare LLM tool, configuration, and prompt setups using a... |
|
Experimental |
| 114 |
JanFalkin/llmbench
pprof for LLM inference. Benchmark and analyze performance of... |
|
Experimental |
| 115 |
guhcostan/gym-ai-benchmark
AI Benchmark for Physical Education and Gym Training Knowledge - Evaluate... |
|
Experimental |
| 116 |
brianpeiris/llm-basic-letter-counting-benchmark
A basic letter-counting benchmark for LLMs |
|
Experimental |
| 117 |
reuank/ThinkBench
ThinkBench is an LLM benchmarking tool focused on evaluating the... |
|
Experimental |