Domain Specific Benchmarks Transformer Models
There are 47 domain specific benchmarks models tracked. 3 score above 50 (established tier). The highest-rated is stanfordnlp/axbench at 54/100 with 175 stars.
Get all 47 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=domain-specific-benchmarks&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Model | Score | Tier |
|---|---|---|---|
| 1 |
stanfordnlp/axbench
Stanford NLP Python library for benchmarking the utility of LLM... |
|
Established |
| 2 |
aidatatools/ollama-benchmark
LLM Benchmark for Throughput via Ollama (Local LLMs) |
|
Established |
| 3 |
LarHope/ollama-benchmark
Ollama based Benchmark with detail I/O token per second. Python with... |
|
Established |
| 4 |
qcri/LLMeBench
Benchmarking Large Language Models |
|
Emerging |
| 5 |
THUDM/LongBench
LongBench v2 and LongBench (ACL 25'&24') |
|
Emerging |
| 6 |
microsoft/LLF-Bench
A benchmark for evaluating learning agents based on just language feedback |
|
Emerging |
| 7 |
YJiangcm/FollowBench
[ACL 2024] FollowBench: A Multi-level Fine-grained Constraints Following... |
|
Emerging |
| 8 |
The-FinAI/CALM
A LLM training and evaluation benchmark for credit scoring |
|
Emerging |
| 9 |
RedHatResearch/conext24-NetConfEval
Benchmark for evaluating LLMs in network configuration problems. |
|
Emerging |
| 10 |
OpenBMB/InfiniteBench
Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K... |
|
Emerging |
| 11 |
cloudmercato/ollama-benchmark
Handy tool to measure the performance and efficiency of LLMs workloads. |
|
Emerging |
| 12 |
rohit901/VANE-Bench
[NAACL'25] Contains code and documentation for our VANE-Bench paper. |
|
Emerging |
| 13 |
AnkitNayak-eth/llmBench
llmBench is a high-depth benchmarking tool designed to measure the raw... |
|
Emerging |
| 14 |
epfml/llm-optimizer-benchmark
Benchmarking Optimizers for LLM Pretraining |
|
Emerging |
| 15 |
AIFEG/BenchLMM
[ECCV 2024] BenchLMM: Benchmarking Cross-style Visual Capability of Large... |
|
Emerging |
| 16 |
zhchen18/ToMBench
ToMBench: Benchmarking Theory of Mind in Large Language Models, ACL 2024. |
|
Emerging |
| 17 |
ApplyU-ai/ColorBlindnessEval
ColorBlindnessEval: Can Vision Language Models Pass Color Blindness Tests? |
|
Emerging |
| 18 |
SORRY-Bench/sorry-bench
Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large... |
|
Emerging |
| 19 |
SapienzaNLP/ita-bench
A collection of Italian benchmarks for LLM evaluation |
|
Emerging |
| 20 |
deep-symbolic-mathematics/llm-srbench
[ICML2025 Oral] LLM-SRBench: A New Benchmark for Scientific Equation... |
|
Emerging |
| 21 |
HiThink-Research/BizFinBench
A Business-Driven Real-World Financial Benchmark for Evaluating LLMs |
|
Emerging |
| 22 |
RaptorMai/MLLM-CompBench
[NeurIPS'25] MLLM-CompBench evaluates the comparative reasoning of MLLMs... |
|
Emerging |
| 23 |
EternityYW/RUPBench
RUPBench: Benchmarking Reasoning Under Perturbations for Robustness... |
|
Emerging |
| 24 |
EternityYW/TRAM-Benchmark
TRAM: Benchmarking Temporal Reasoning for Large Language Models (Findings of... |
|
Emerging |
| 25 |
zchuz/TimeBench
The repository for ACL 2024 paper "TimeBench: A Comprehensive Evaluation of... |
|
Experimental |
| 26 |
MileBench/MileBench
This repo contains evaluation code for the paper "MileBench: Benchmarking... |
|
Experimental |
| 27 |
Open-Social-World/EgoNormia
EgoNormia | Benchmarking Physical Social Norm Understanding in VLMs |
|
Experimental |
| 28 |
AUCOHL/RTL-Repo
RTL-Repo: A Benchmark for Evaluating LLMs on Large-Scale RTL Design Projects... |
|
Experimental |
| 29 |
PKU-YuanGroup/Video-Bench
A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large... |
|
Experimental |
| 30 |
SpiritsYouthHarmony/awesome-llm-physics-benchmarks
A curated list of benchmarks for evaluating LLMs on physics reasoning and... |
|
Experimental |
| 31 |
mbeps/llama_3.1_italic_benchmark
Benchmarking Llama 3.1 models of various sizes on the ITALIC benchmark to... |
|
Experimental |
| 32 |
mbeps/mistral_italic_benchmark
Benchmarking Mistral NeMo for Italian Cultural Alignment using ITALIC benchmark |
|
Experimental |
| 33 |
mbeps/magistral_italic_benchmark
Benchmarking Magistra Small model on the ITALIC benchmark to evluate their... |
|
Experimental |
| 34 |
mbeps/qwen3-italic-benchmark
Benchmarking Qwen3 models f various sizes on the ITALIC benchmark to evluate... |
|
Experimental |
| 35 |
marcosgarciadata/llm-performance-benchmarker
Standardized benchmarking suite for evaluating Large Language Model latency,... |
|
Experimental |
| 36 |
NLP-Final-Projects/citation-benchmark
A benchmark and evaluation pipeline for citation-aware text generation, with... |
|
Experimental |
| 37 |
Shengwei-Peng/TOCFL-MultiBench
TOCFL-MultiBench: A multimodal benchmark for evaluating Chinese language... |
|
Experimental |
| 38 |
dippatel1994/Large-Language-Models-Evaluation-Benchmarks-Collection
This repository contains a list of benchmarks used by big orgs to evaluate... |
|
Experimental |
| 39 |
thejatingupta7/LLMCA
🤖 Large Language Models Acing Chartered Accountancy: Introduces CA‑Ben 📈, a... |
|
Experimental |
| 40 |
Thopterek/ChessBenchmark
Aleph Alpha and LEVEL3, LLM benchmark |
|
Experimental |
| 41 |
mpuodziukas-labs/llm-cobol-benchmark
Systematic benchmark: top LLMs produce broken COBOL. 5 programs, 3 models,... |
|
Experimental |
| 42 |
mohiuddinshahrukh/Shahrukh_clem_IM
A function induction game testing various LLMs with test functions and... |
|
Experimental |
| 43 |
husayni/gsm-u
Novel benchmark for underspecified queries |
|
Experimental |
| 44 |
doeunyy/pokerbench-slm-decision-making
Fine-tuning small language models (≤4B) for poker decision-making under... |
|
Experimental |
| 45 |
JeroenVanGorsel/stock-bench
Stock Bench is an LLM benchmarking system where LLMs compete in a prediction... |
|
Experimental |
| 46 |
alexteghipco/LLMReadabilityBenchmark
Human readability judgments as a benchmark for LLMs |
|
Experimental |
| 47 |
g4ix/advLab1-HITS
Project for an advanced lab investigating LLM benchmarks from an IR... |
|
Experimental |