Domain Specific Benchmarks Transformer Models

There are 47 domain specific benchmarks models tracked. 3 score above 50 (established tier). The highest-rated is stanfordnlp/axbench at 54/100 with 175 stars.

Get all 47 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=domain-specific-benchmarks&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Model Score Tier
1 stanfordnlp/axbench

Stanford NLP Python library for benchmarking the utility of LLM...

54
Established
2 aidatatools/ollama-benchmark

LLM Benchmark for Throughput via Ollama (Local LLMs)

53
Established
3 LarHope/ollama-benchmark

Ollama based Benchmark with detail I/O token per second. Python with...

53
Established
4 qcri/LLMeBench

Benchmarking Large Language Models

47
Emerging
5 THUDM/LongBench

LongBench v2 and LongBench (ACL 25'&24')

45
Emerging
6 microsoft/LLF-Bench

A benchmark for evaluating learning agents based on just language feedback

45
Emerging
7 YJiangcm/FollowBench

[ACL 2024] FollowBench: A Multi-level Fine-grained Constraints Following...

45
Emerging
8 The-FinAI/CALM

A LLM training and evaluation benchmark for credit scoring

41
Emerging
9 RedHatResearch/conext24-NetConfEval

Benchmark for evaluating LLMs in network configuration problems.

40
Emerging
10 OpenBMB/InfiniteBench

Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K...

40
Emerging
11 cloudmercato/ollama-benchmark

Handy tool to measure the performance and efficiency of LLMs workloads.

39
Emerging
12 rohit901/VANE-Bench

[NAACL'25] Contains code and documentation for our VANE-Bench paper.

38
Emerging
13 AnkitNayak-eth/llmBench

llmBench is a high-depth benchmarking tool designed to measure the raw...

38
Emerging
14 epfml/llm-optimizer-benchmark

Benchmarking Optimizers for LLM Pretraining

38
Emerging
15 AIFEG/BenchLMM

[ECCV 2024] BenchLMM: Benchmarking Cross-style Visual Capability of Large...

35
Emerging
16 zhchen18/ToMBench

ToMBench: Benchmarking Theory of Mind in Large Language Models, ACL 2024.

35
Emerging
17 ApplyU-ai/ColorBlindnessEval

ColorBlindnessEval: Can Vision Language Models Pass Color Blindness Tests?

33
Emerging
18 SORRY-Bench/sorry-bench

Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large...

33
Emerging
19 SapienzaNLP/ita-bench

A collection of Italian benchmarks for LLM evaluation

32
Emerging
20 deep-symbolic-mathematics/llm-srbench

[ICML2025 Oral] LLM-SRBench: A New Benchmark for Scientific Equation...

32
Emerging
21 HiThink-Research/BizFinBench

A Business-Driven Real-World Financial Benchmark for Evaluating LLMs

32
Emerging
22 RaptorMai/MLLM-CompBench

[NeurIPS'25] MLLM-CompBench evaluates the comparative reasoning of MLLMs...

31
Emerging
23 EternityYW/RUPBench

RUPBench: Benchmarking Reasoning Under Perturbations for Robustness...

31
Emerging
24 EternityYW/TRAM-Benchmark

TRAM: Benchmarking Temporal Reasoning for Large Language Models (Findings of...

30
Emerging
25 zchuz/TimeBench

The repository for ACL 2024 paper "TimeBench: A Comprehensive Evaluation of...

29
Experimental
26 MileBench/MileBench

This repo contains evaluation code for the paper "MileBench: Benchmarking...

29
Experimental
27 Open-Social-World/EgoNormia

EgoNormia | Benchmarking Physical Social Norm Understanding in VLMs

29
Experimental
28 AUCOHL/RTL-Repo

RTL-Repo: A Benchmark for Evaluating LLMs on Large-Scale RTL Design Projects...

29
Experimental
29 PKU-YuanGroup/Video-Bench

A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large...

23
Experimental
30 SpiritsYouthHarmony/awesome-llm-physics-benchmarks

A curated list of benchmarks for evaluating LLMs on physics reasoning and...

23
Experimental
31 mbeps/llama_3.1_italic_benchmark

Benchmarking Llama 3.1 models of various sizes on the ITALIC benchmark to...

22
Experimental
32 mbeps/mistral_italic_benchmark

Benchmarking Mistral NeMo for Italian Cultural Alignment using ITALIC benchmark

22
Experimental
33 mbeps/magistral_italic_benchmark

Benchmarking Magistra Small model on the ITALIC benchmark to evluate their...

22
Experimental
34 mbeps/qwen3-italic-benchmark

Benchmarking Qwen3 models f various sizes on the ITALIC benchmark to evluate...

22
Experimental
35 marcosgarciadata/llm-performance-benchmarker

Standardized benchmarking suite for evaluating Large Language Model latency,...

22
Experimental
36 NLP-Final-Projects/citation-benchmark

A benchmark and evaluation pipeline for citation-aware text generation, with...

21
Experimental
37 Shengwei-Peng/TOCFL-MultiBench

TOCFL-MultiBench: A multimodal benchmark for evaluating Chinese language...

20
Experimental
38 dippatel1994/Large-Language-Models-Evaluation-Benchmarks-Collection

This repository contains a list of benchmarks used by big orgs to evaluate...

19
Experimental
39 thejatingupta7/LLMCA

🤖 Large Language Models Acing Chartered Accountancy: Introduces CA‑Ben 📈, a...

19
Experimental
40 Thopterek/ChessBenchmark

Aleph Alpha and LEVEL3, LLM benchmark

15
Experimental
41 mpuodziukas-labs/llm-cobol-benchmark

Systematic benchmark: top LLMs produce broken COBOL. 5 programs, 3 models,...

14
Experimental
42 mohiuddinshahrukh/Shahrukh_clem_IM

A function induction game testing various LLMs with test functions and...

13
Experimental
43 husayni/gsm-u

Novel benchmark for underspecified queries

13
Experimental
44 doeunyy/pokerbench-slm-decision-making

Fine-tuning small language models (≤4B) for poker decision-making under...

13
Experimental
45 JeroenVanGorsel/stock-bench

Stock Bench is an LLM benchmarking system where LLMs compete in a prediction...

11
Experimental
46 alexteghipco/LLMReadabilityBenchmark

Human readability judgments as a benchmark for LLMs

11
Experimental
47 g4ix/advLab1-HITS

Project for an advanced lab investigating LLM benchmarks from an IR...

10
Experimental