Domain Specific Benchmarks Transformer Models

There are 47 domain specific benchmarks models tracked. 3 score above 50 (established tier). The highest-rated is stanfordnlp/axbench at 54/100 with 175 stars.

Get all 47 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=domain-specific-benchmarks&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Model	Score	Tier	Stars	Language
1	stanfordnlp/axbench Stanford NLP Python library for benchmarking the utility of LLM...	54	Established	175	Python
2	aidatatools/ollama-benchmark LLM Benchmark for Throughput via Ollama (Local LLMs)	53	Established	345	Python
3	LarHope/ollama-benchmark Ollama based Benchmark with detail I/O token per second. Python with...	53	Established	45	Python
4	qcri/LLMeBench Benchmarking Large Language Models	47	Emerging	105	Python
5	THUDM/LongBench LongBench v2 and LongBench (ACL 25'&24')	45	Emerging	1,113	Python
6	microsoft/LLF-Bench A benchmark for evaluating learning agents based on just language feedback	45	Emerging	95	Python
7	YJiangcm/FollowBench [ACL 2024] FollowBench: A Multi-level Fine-grained Constraints Following...	45	Emerging	119	Python
8	The-FinAI/CALM A LLM training and evaluation benchmark for credit scoring	41	Emerging	65	Python
9	RedHatResearch/conext24-NetConfEval Benchmark for evaluating LLMs in network configuration problems.	40	Emerging	34	Python
10	OpenBMB/InfiniteBench Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K...	40	Emerging	378	Python
11	cloudmercato/ollama-benchmark Handy tool to measure the performance and efficiency of LLMs workloads.	39	Emerging	76	Python
12	rohit901/VANE-Bench [NAACL'25] Contains code and documentation for our VANE-Bench paper.	38	Emerging	23	Python
13	AnkitNayak-eth/llmBench llmBench is a high-depth benchmarking tool designed to measure the raw...	38	Emerging	24	Python
14	epfml/llm-optimizer-benchmark Benchmarking Optimizers for LLM Pretraining	38	Emerging	56	Python
15	AIFEG/BenchLMM [ECCV 2024] BenchLMM: Benchmarking Cross-style Visual Capability of Large...	35	Emerging	86	Python
16	zhchen18/ToMBench ToMBench: Benchmarking Theory of Mind in Large Language Models, ACL 2024.	35	Emerging	66	Python
17	ApplyU-ai/ColorBlindnessEval ColorBlindnessEval: Can Vision Language Models Pass Color Blindness Tests?	33	Emerging	4	—
18	SORRY-Bench/sorry-bench Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large...	33	Emerging	77	Jupyter Notebook
19	SapienzaNLP/ita-bench A collection of Italian benchmarks for LLM evaluation	32	Emerging	37	Python
20	deep-symbolic-mathematics/llm-srbench [ICML2025 Oral] LLM-SRBench: A New Benchmark for Scientific Equation...	32	Emerging	94	Python
21	HiThink-Research/BizFinBench A Business-Driven Real-World Financial Benchmark for Evaluating LLMs	32	Emerging	211	Python
22	RaptorMai/MLLM-CompBench [NeurIPS'25] MLLM-CompBench evaluates the comparative reasoning of MLLMs...	31	Emerging	44	Jupyter Notebook
23	EternityYW/RUPBench RUPBench: Benchmarking Reasoning Under Perturbations for Robustness...	31	Emerging	4	Jupyter Notebook
24	EternityYW/TRAM-Benchmark TRAM: Benchmarking Temporal Reasoning for Large Language Models (Findings of...	30	Emerging	26	Jupyter Notebook
25	zchuz/TimeBench The repository for ACL 2024 paper "TimeBench: A Comprehensive Evaluation of...	29	Experimental	34	Python
26	MileBench/MileBench This repo contains evaluation code for the paper "MileBench: Benchmarking...	29	Experimental	36	Python
27	Open-Social-World/EgoNormia EgoNormia \| Benchmarking Physical Social Norm Understanding in VLMs	29	Experimental	12	Jupyter Notebook
28	AUCOHL/RTL-Repo RTL-Repo: A Benchmark for Evaluating LLMs on Large-Scale RTL Design Projects...	29	Experimental	34	Python
29	PKU-YuanGroup/Video-Bench A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large...	23	Experimental	138	Python
30	SpiritsYouthHarmony/awesome-llm-physics-benchmarks A curated list of benchmarks for evaluating LLMs on physics reasoning and...	23	Experimental	1	—
31	mbeps/llama_3.1_italic_benchmark Benchmarking Llama 3.1 models of various sizes on the ITALIC benchmark to...	22	Experimental	1	Jupyter Notebook
32	mbeps/mistral_italic_benchmark Benchmarking Mistral NeMo for Italian Cultural Alignment using ITALIC benchmark	22	Experimental	1	Jupyter Notebook
33	mbeps/magistral_italic_benchmark Benchmarking Magistra Small model on the ITALIC benchmark to evluate their...	22	Experimental	1	Jupyter Notebook
34	mbeps/qwen3-italic-benchmark Benchmarking Qwen3 models f various sizes on the ITALIC benchmark to evluate...	22	Experimental	1	Jupyter Notebook
35	marcosgarciadata/llm-performance-benchmarker Standardized benchmarking suite for evaluating Large Language Model latency,...	22	Experimental	—	JavaScript
36	NLP-Final-Projects/citation-benchmark A benchmark and evaluation pipeline for citation-aware text generation, with...	21	Experimental	—	Jupyter Notebook
37	Shengwei-Peng/TOCFL-MultiBench TOCFL-MultiBench: A multimodal benchmark for evaluating Chinese language...	20	Experimental	8	Python
38	dippatel1994/Large-Language-Models-Evaluation-Benchmarks-Collection This repository contains a list of benchmarks used by big orgs to evaluate...	19	Experimental	4	—
39	thejatingupta7/LLMCA 🤖 Large Language Models Acing Chartered Accountancy: Introduces CA‑Ben 📈, a...	19	Experimental	2	Python
40	Thopterek/ChessBenchmark Aleph Alpha and LEVEL3, LLM benchmark	15	Experimental	2	Jupyter Notebook
41	mpuodziukas-labs/llm-cobol-benchmark Systematic benchmark: top LLMs produce broken COBOL. 5 programs, 3 models,...	14	Experimental	—	COBOL
42	mohiuddinshahrukh/Shahrukh_clem_IM A function induction game testing various LLMs with test functions and...	13	Experimental	—	TeX
43	husayni/gsm-u Novel benchmark for underspecified queries	13	Experimental	—	Python
44	doeunyy/pokerbench-slm-decision-making Fine-tuning small language models (≤4B) for poker decision-making under...	13	Experimental	—	Jupyter Notebook
45	JeroenVanGorsel/stock-bench Stock Bench is an LLM benchmarking system where LLMs compete in a prediction...	11	Experimental	—	Python
46	alexteghipco/LLMReadabilityBenchmark Human readability judgments as a benchmark for LLMs	11	Experimental	—	Jupyter Notebook
47	g4ix/advLab1-HITS Project for an advanced lab investigating LLM benchmarks from an IR...	10	Experimental	1	Jupyter Notebook

Comparisons in this category

ollama-benchmark and LLMeBench (53 vs 47) ollama-benchmark and llm-optimizer-benchmark (53 vs 38) ollama-benchmark and llmBench (53 vs 38) LLMeBench and LLF-Bench (47 vs 45) LLMeBench and llm-optimizer-benchmark (47 vs 38) ollama-benchmark and llmBench (39 vs 38)