Domain-Specific Benchmarks LLM Tools

Benchmarks evaluating LLMs on specialized knowledge domains (legal, OSINT, cyber, numerical reasoning, KGs) and role-playing tasks. Does NOT include general-purpose LLM evaluation, vision-language model benchmarks, or cultural alignment tests.

There are 117 domain-specific benchmarks tools tracked. 1 score above 70 (verified tier). The highest-rated is sierra-research/tau2-bench at 72/100 with 829 stars. 2 of the top 10 are actively maintained.

Get all 117 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=domain-specific-benchmarks&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	sierra-research/tau2-bench τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment	72	Verified	829	Python
2	xlang-ai/OSWorld [NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks...	66	Established	2,664	Python
3	bigcode-project/bigcodebench [ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI	61	Established	484	Python
4	scicode-bench/SciCode A benchmark that challenges language models to code solutions for scientific problems	55	Established	179	Python
5	THUDM/AgentBench A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)	55	Established	3,234	Python
6	swefficiency/swefficiency Benchmark harness and code for "SWE-fficiency: Can Language Models Optimize...	54	Established	15	Python
7	logic-star-ai/swt-bench [NeurIPS 2024] Evaluation harness for SWT-Bench, a benchmark for evaluating...	54	Established	72	Python
8	microsoft/SWE-bench-Live [NeurIPS 2025 D&B] 🚀 SWE-bench Goes Live!	53	Established	170	Python
9	alibaba/sec-code-bench SecCodeBench is a benchmark suite focusing on evaluating the security of...	52	Established	97	Python
10	open-compass/LawBench Benchmarking Legal Knowledge of Large Language Models	48	Emerging	406	Python
11	langchain-ai/langchain-benchmarks 🦜💯 Flex those feathers!	48	Emerging	255	Python
12	OpenGenerativeAI/llm-colosseum Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the...	47	Emerging	1,467	Jupyter Notebook
13	camel-ai/crab 🦀️ CRAB: Cross-environment Agent Benchmark for Multimodal Language Model...	47	Emerging	405	Python
14	web-arena-x/visualwebarena VisualWebArena is a benchmark for multimodal agents.	47	Emerging	445	Python
15	rentruewang/bocoel Bayesian Optimization as a Coverage Tool for Evaluating LLMs. Accurate...	47	Emerging	289	Python
16	HUST-AI-HYZ/MemoryAgentBench Open source code for ICLR 2026 Paper: Evaluating Memory in LLM Agents via...	46	Emerging	253	Python
17	X-PLUG/WritingBench WritingBench: A Comprehensive Benchmark for Generative Writing	46	Emerging	163	Python
18	zhangxjohn/LLM-Agent-Benchmark-List A banchmark list for evaluation of large language models.	46	Emerging	160	—
19	IBM/ACPBench ACPBench: Reasoning about Action, Change, and Planning. A benchmark...	46	Emerging	32	Python
20	cornell-zhang/heurigym Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization (ICLR'26)	45	Emerging	64	Python
21	OskarsEzerins/llm-benchmarks Popular LLM benchmarks for ruby code generation	45	Emerging	75	Ruby
22	OceanGPT/OceanGym OceanGym: A Benchmark Environment for Underwater Embodied Agents	45	Emerging	100	Python
23	AKSW/LLM-KG-Bench LLM-KG-Bench is a Framework and task collection for automated benchmarking...	43	Emerging	56	Python
24	mims-harvard/CUREBench CUREBench @ NeurIPS 2025: Benchmarking AI reasoning for therapeutic...	43	Emerging	129	Python
25	lavantien/llm-tournament Simple and blazingly fast dynamic evaluation platform for benchmarking Large...	43	Emerging	8	Go
26	principia-ai/PhysGym A benchmark suite for evaluating LLM-based interactive scientific reasoning.	43	Emerging	92	Python
27	actiontech/sql-llm-benchmark SCALE: SQL Capability Leaderboard for LLMs	42	Emerging	23	TypeScript
28	THUNLP-MT/StableToolBench A new tool learning benchmark aiming at well-balanced stability and reality,...	42	Emerging	220	Python
29	arthur-ai/bench A tool for evaluating LLMs	42	Emerging	428	TypeScript
30	Ammaar-Alam/minebench Minecraft-style voxel benchmark for comparing AI models (Arena + Sandbox)	42	Emerging	120	TypeScript
31	srikanth235/benchllama Benchmark your local LLMs.	40	Emerging	53	Python
32	IlyaGusev/ping_pong_bench A benchmark for role-playing language models	40	Emerging	116	Python
33	braingpt-lovelab/BrainBench Source code for	40	Emerging	85	—
34	adobe-research/NoLiMa Official repository for "NoLiMa: Long-Context Evaluation Beyond Literal Matching"	40	Emerging	186	Python
35	ByteDance-Seed/WideSearch WideSearch: Benchmarking Agentic Broad Info-Seeking	40	Emerging	127	Python
36	GoodAI/goodai-ltm-benchmark A library for benchmarking the Long Term Memory and Continual learning...	40	Emerging	84	HTML
37	LiqiangJing/DSBench [ICLR 2025] DSBench: How Far are Data Science Agents from Becoming Data...	39	Emerging	108	Jupyter Notebook
38	InternScience/SGI-Bench Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows	39	Emerging	156	Python
39	msu-denver/bili-core bili-core is an open-source framework for LLM benchmarking using LangChain,...	39	Emerging	9	Python
40	mazzzystar/TurtleBench TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles.	39	Emerging	163	Jupyter Notebook
41	gersteinlab/ML-Bench ML-Bench: Evaluating Large Language Models and Agents for Machine Learning...	38	Emerging	318	Python
42	SAP-samples/llm-agents-eval-tutorial Tutorial Materials for the paper "Evaluation & Benchmarking of LLM Agents: A...	38	Emerging	16	Jupyter Notebook
43	eth-lre/mathtutorbench Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors,...	38	Emerging	32	Python
44	stevesolun/Chameleon 🦎 Benchmark LLM robustness under semantic paraphrasing. Tests how models...	37	Emerging	3	Python
45	jpmorganchase/CyberBench CyberBench: A Multi-Task Cyber LLM Benchmark	37	Emerging	30	Python
46	humanlaya/OneMillion-Bench Official evals for $OneMillion-Bench	37	Emerging	32	Python
47	lechmazur/nyt-connections Benchmark that evaluates LLMs using 759 NYT Connections puzzles extended...	37	Emerging	199	Python
48	THUDM/AlignBench 大模型多维度中文对齐评测基准 (ACL 2024)	37	Emerging	421	Python
49	THUDM/VisualAgentBench Towards Large Multimodal Models as Visual Foundation Agents	37	Emerging	258	Python
50	Q-Future/Q-Bench ①[ICLR2024 Spotlight] (GPT-4V/Gemini-Pro/Qwen-VL-Plus+16 OS MLLMs) A...	36	Emerging	282	Jupyter Notebook
51	rohanelukurthy/rig-rank A Go CLI tool to benchmark local LLMs via Ollama, measuring Time To First...	36	Emerging	18	Go
52	MetriLLM/metrillm Benchmark local LLM models: speed, quality, and hardware fitness scoring....	36	Emerging	3	TypeScript
53	parameterlab/c-seo-bench Source code of "C-SEO Bench: Does Conversational SEO Work?" NeurIPS D&B 2025	36	Emerging	16	Jupyter Notebook
54	Laoyu84/4onebench A minimalist benchmarking tool designed to test the routine-generation...	35	Emerging	27	Python
55	TrustAIRLab/HateBench [USENIX'25] HateBench: Benchmarking Hate Speech Detectors on LLM-Generated...	35	Emerging	13	—
56	Cybonto/OllaBench Evaluating LLMs' Cognitive Behavioral Reasoning for Cybersecurity	35	Emerging	5	Jupyter Notebook
57	terryyz/llm-benchmark A list of LLM benchmark frameworks.	35	Emerging	73	—
58	ImBIOS/thiqah-ops AI SysAdmin Trust Benchmark - Comprehensive testing suite for evaluating LLM...	35	Emerging	5	TypeScript
59	ag-sc/Robo-CSK-Benchmark Benchmark for evaluating Embodied Commonsense Capabilities (e.g. of LLMs)	34	Emerging	4	Python
60	YerbaPage/SWE-Exp SWE-Exp: Experience-Driven Software Issue Resolution	34	Emerging	37	Python
61	ccmdi/osintbench OSINT benchmark for language models	34	Emerging	7	Python
62	EachSheep/ShortcutsBench ShortcutsBench: A Large-Scale Real-World Benchmark for API-Based Agents	34	Emerging	110	Python
63	FreedomIntelligence/MTalk-Bench MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via...	33	Emerging	18	JavaScript
64	OpenGVLab/Multi-Modality-Arena Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to...	33	Emerging	557	Python
65	ormeilu/RuCa RuCa Benchmark (pronounced "roo-ka") - Russian Tool Calling Benchmark for LLM	32	Emerging	7	Python
66	research-outcome/LLM-Game-Benchmark Evaluating Large Language Models with Grid-Based Game Competitions: An...	32	Emerging	24	JavaScript
67	TheDuckAI/arb Advanced Reasoning Benchmark Dataset for LLMs	31	Emerging	47	TypeScript
68	ScholarXIV/enkokilish_bench Amharic Riddle Benchmark for LLMs	30	Emerging	25	Svelte
69	zjunlp/ChineseHarm-bench ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark	30	Emerging	49	Python
70	CAS-SIAT-XinHai/CPsyExam [COLING 2025] CPsyExam: A Chinese Benchmark for Evaluating Psychology using...	29	Experimental	7	Python
71	zihao-ai/EARBench Benchmarking Physical Risk Awareness of Foundation Model-based Embodied AI Agents	29	Experimental	23	Python
72	stalkermustang/llm-bulls-and-cows-benchmark A mini-framework for evaluating LLM performance on the Bulls and Cows number...	28	Experimental	237	HTML
73	nttmdlab-nlp/ToMATO ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking...	28	Experimental	19	Python
74	dylan-slack/Tablet The TABLET benchmark for evaluating instruction learning with LLMs for...	28	Experimental	25	Python
75	Swival/calibra A benchmarking harness for coding agents.	27	Experimental	3	Python
76	IAAR-Shanghai/NewsBench [ACL 2024 Main] NewsBench: A Systematic Evaluation Framework for Assessing...	26	Experimental	34	Python
77	oaimli/SciTrek Benchmarking long-context reasoning on scientific articles	26	Experimental	1	Python
78	caixd-220529/LifelongAgentBench Code repo for "LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners"	26	Experimental	80	Python
79	jordan-gibbs/secret-hitler-bench An LLM benchmark based on the popular social deception game, Secret Hitler....	26	Experimental	8	Python
80	ChutaVeias/thiqah-ops 🤖 Evaluate AI competence in sysadmin tasks with ThiqahOps, a benchmark suite...	25	Experimental	1	TypeScript
81	Boopi7/brain-bench Source code for	25	Experimental	16	TypeScript
82	umayer16/VIBEBENCH An automated framework for holistic evaluation of LLM-generated code using...	25	Experimental	1	Python
83	VisualWebBench/VisualWebBench Evaluation framework for paper "VisualWebBench: How Far Have Multimodal LLMs...	25	Experimental	65	Python
84	mnbplus/llm-gateway-bench CLI benchmark suite for LLM providers and OpenAI-compatible gateways....	25	Experimental	3	Python
85	lechmazur/generalization Thematic Generalization Benchmark: measures how effectively various LLMs can...	23	Experimental	63	—
86	GAIR-NLP/benbench Benchmarking Benchmark Leakage in Large Language Models	23	Experimental	60	JavaScript
87	Visual-AI/GAMEBoT [ACL 2025] GAMEBoT: Transparent Assessment of LLM Reasoning in Games	23	Experimental	31	Python
88	HSTRG1/GHOST_benchmarks A collection of hardware Trojans (HTs) automatically generated by Large...	22	Experimental	11	—
89	ArbitrHq/ocr-mini-bench Official OCR mini-bench repository for public use.	22	Experimental	—	TypeScript
90	wimi321/task-bundle Turn AI coding runs into portable, replayable, benchmark-ready task bundles.	22	Experimental	—	TypeScript
91	KID-22/Cocktail Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated...	22	Experimental	15	Python
92	VTSTech/VTSTech-GPTBench Benchmark Ollama Models for Instruction Following, Tool Calling and Agent Workflows	22	Experimental	1	Python
93	hra42/krites LLM benchmark platform comparing models with real-time streaming, metrics,...	22	Experimental	—	Go
94	KandyBoi1/enkokilish_bench 🧩 Benchmark LLMs on their ability to solve Amharic riddles using Evalite for...	22	Experimental	—	Svelte
95	pvlbzn/latai LatAI – A latency benchmarking tool for evaluating multiple generative AI...	21	Experimental	9	Go
96	contactvaibhavi/GVR-Bench Pipeline to investigate structured reasoning and instruction adherence in...	21	Experimental	—	Python
97	0xsomesh/rawbench RawBench: Powerful, minimal framework for LLM prompt evaluation with YAML...	21	Experimental	8	TypeScript
98	MarcT0K/TOSSS-LLM-Benchmark TOSSS, an extensible LLM security benchmark based on the CVE database	21	Experimental	—	Python
99	PrimisAI/arcbench A benchmark for evaluating advanced reasoning in language models and...	21	Experimental	6	Python
100	Mr-Dark-debug/RetardBench RetardBench is an open, no-censorship benchmark that ranks large language...	21	Experimental	—	TypeScript
101	NickRiccardi/two-word-test Two Word Test: Combinatorial Semantic Benchmark for LLMs	20	Experimental	7	Jupyter Notebook
102	Antix5/ProductBench This is a benchmark to see LLMs ability to understand complex product...	19	Experimental	1	HTML
103	lemon07r/SanityBoard Home of the SanityHarness Leaderboard website.	18	Experimental	14	HTML
104	abronte/wordlebench WordleBench is a benchmark for evaluating LLMs on their ability to solve...	17	Experimental	1	HTML
105	kreasof-ai/infinite-benchmark-glitch We Found an Infinite Benchmark Glitch: Dynamic N-Dimensional Grid Regression...	17	Experimental	1	—
106	alextyhwang/Chatio-LLM-Benchmark The benchmark for real-world helpfulness. Evaluating LLMs on empathy,...	17	Experimental	6	TeX
107	gqgs/llm100kbench LLM 100k portfolio management benchmark	16	Experimental	44	Go
108	jdleo/weirdbench Open-source LLM benchmarking site for unconventional evals, with local...	15	Experimental	1	TypeScript
109	GAIR-NLP/BeHonest BeHonest: Benchmarking Honesty in Large Language Models	15	Experimental	34	JavaScript
110	not-shivansh/AI-Bench-AI-Evaluation AI benchmarking platform using Groq (LLaMA 3.1) with hybrid NLP evaluation...	14	Experimental	—	CSS
111	Overarm-philippinecedar244/blindbench Diagnose reasoning errors in large language models using blind human voting...	14	Experimental	—	JavaScript
112	Rootly-AI-Labs/GMCQ-benchmark Evaluation benchmark for language models to understand code to close pull requests.	14	Experimental	6	—
113	xInfer123/octobench Benchmark and compare LLM tool, configuration, and prompt setups using a...	14	Experimental	—	—
114	JanFalkin/llmbench pprof for LLM inference. Benchmark and analyze performance of...	13	Experimental	—	Go
115	guhcostan/gym-ai-benchmark AI Benchmark for Physical Education and Gym Training Knowledge - Evaluate...	13	Experimental	—	TypeScript
116	brianpeiris/llm-basic-letter-counting-benchmark A basic letter-counting benchmark for LLMs	12	Experimental	1	TypeScript
117	reuank/ThinkBench ThinkBench is an LLM benchmarking tool focused on evaluating the...	11	Experimental	3	Python

Comparisons in this category

bigcodebench and AgentBench (61 vs 55) AgentBench and LawBench (55 vs 48) AgentBench and MemoryAgentBench (55 vs 46) AgentBench and LLM-Agent-Benchmark-List (55 vs 46) AgentBench and heurigym (55 vs 45)