Evaluation Frameworks Metrics Transformer Models
There are 44 evaluation frameworks metrics models tracked. 3 score above 50 (established tier). The highest-rated is eth-sri/matharena at 52/100 with 229 stars.
Get all 44 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=evaluation-frameworks-metrics&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Model | Score | Tier |
|---|---|---|---|
| 1 |
eth-sri/matharena
Evaluation of LLMs on latest math competitions |
|
Established |
| 2 |
tatsu-lab/alpaca_eval
An automatic evaluator for instruction-following language models.... |
|
Established |
| 3 |
HPAI-BSC/TuRTLe
TuRTLe: A Unified Evaluation of LLMs for RTL Generation 🐢 (MLCAD 2025) |
|
Established |
| 4 |
nlp-uoregon/mlmm-evaluation
Multilingual Large Language Models Evaluation Benchmark |
|
Emerging |
| 5 |
haesleinhuepf/human-eval-bia
Benchmarking Large Language Models for Bio-Image Analysis Code Generation |
|
Emerging |
| 6 |
ShuntaroOkuma/adapt-gauge-core
Measure LLM adaptation efficiency — how fast models learn from few examples |
|
Emerging |
| 7 |
pyladiesams/eval-llm-based-apps-jan2025
Create an evaluation framework for your LLM based app. Incorporate it into... |
|
Emerging |
| 8 |
princeton-nlp/LLMBar
[ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following |
|
Emerging |
| 9 |
chziakas/redeval
A library for red-teaming LLM applications with LLMs. |
|
Emerging |
| 10 |
JinjieNi/MixEval
The official evaluation suite and dynamic data release for MixEval. |
|
Emerging |
| 11 |
grigio/llm-eval-simple
llm-eval-simple is a simple LLM evaluation framework with intermediate... |
|
Emerging |
| 12 |
Praveengovianalytics/falcon-evaluate
Falcon Evaluate is an open-source Python library aims to revolutionise the... |
|
Emerging |
| 13 |
franckalbinet/iomeval
Streamline evaluation evidence mapping at scale with LLMs |
|
Emerging |
| 14 |
GURPREETKAURJETHRA/LLMs-Evaluation
LLMs Evaluation |
|
Emerging |
| 15 |
TIGER-AI-Lab/TIGERScore
"TIGERScore: Towards Building Explainable Metric for All Text Generation... |
|
Emerging |
| 16 |
waltonfuture/Diff-eRank
[NeurIPS 2024] A Novel Rank-Based Metric for Evaluating Large Language Models |
|
Emerging |
| 17 |
telekom/llm_evaluation_results
LLM evaluation results |
|
Emerging |
| 18 |
brucewlee/nutcracker
Large Model Evaluation Experiments |
|
Experimental |
| 19 |
Praful932/llmsearch
Find better generation parameters for your LLM |
|
Experimental |
| 20 |
aws-samples/model-as-a-judge-eval
Notebooks for evaluating LLM based applications using the Model (LLM) as a... |
|
Experimental |
| 21 |
alphadl/OOP-eval
The first Object-Oriented Programming (OOP) Evaluation Benchmark for LLMs |
|
Experimental |
| 22 |
open-compass/Ada-LEval
The official implementation of "Ada-LEval: Evaluating long-context LLMs with... |
|
Experimental |
| 23 |
horde-research/horde-common
Shared scripts for offline Kazakh LLM eval—run inference, auto-score, and... |
|
Experimental |
| 24 |
DigitalHarborFoundation/FlexEval
FlexEval is an LLM evaluation tool designed for practical quantitative analysis. |
|
Experimental |
| 25 |
jiayuww/SpatialEval
[NeurIPS'24] SpatialEval: a benchmark to evaluate spatial reasoning... |
|
Experimental |
| 26 |
GAIR-NLP/scaleeval
Scalable Meta-Evaluation of LLMs as Evaluators |
|
Experimental |
| 27 |
wahhyun/llm-eval
Evaluate large language models with tools for performance and consistency... |
|
Experimental |
| 28 |
AkhileshMalthi/llm-eval-framework
A production-grade framework for evaluating Large Language Model (LLM)... |
|
Experimental |
| 29 |
vjroy/routeeval
RouteEval: A benchmark for evaluating LLM tool calling in running route... |
|
Experimental |
| 30 |
UMass-Meta-LLM-Eval/llm_eval
A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup... |
|
Experimental |
| 31 |
McTosh1/modal-llm-evaluator
⚡ Evaluate LLM prompts at scale with fast, parallel execution, real-time... |
|
Experimental |
| 32 |
giuliano-t/llm-financial-regulatory-auditor
A structured evaluation pipeline for LLM-generated outputs in financial... |
|
Experimental |
| 33 |
user1342/conjecture
Evaluating the likelihood of data points in a LLM's training set |
|
Experimental |
| 34 |
franciellevargas/MFTCXplain
MFTCXplain is the first multilingual benchmark dataset designed to evaluate... |
|
Experimental |
| 35 |
SouravD-Me/LLM-Evaluation-Dashboard
A Visual Dashboard for Fundamental Benchmarking of LLMs |
|
Experimental |
| 36 |
sanand0/llmmath
How good are LLMs at mental math? An evaluation across 50 models from... |
|
Experimental |
| 37 |
david-xander/measuring-llm-knowledge
How much does an LLM know about my programming language? |
|
Experimental |
| 38 |
serhiismetanskyi/llm-output-evaluation-with-deepeval
DeepEval LLM quality evaluation tests with LLM-as-a-judge |
|
Experimental |
| 39 |
aims-foundations/reeval
Reliable and Efficient Model-based Generative Model Evaluation |
|
Experimental |
| 40 |
dustalov/llmfao
Large Language Model Feedback Analysis and Optimization (LLMFAO) |
|
Experimental |
| 41 |
2pa4ul2/Easygen-v2
Exam Generation With Large Language Model (LLMs) |
|
Experimental |
| 42 |
lfoppiano/mining-llm-evaluation-paper
Source of the article "Mining experimental data from Materials Science... |
|
Experimental |
| 43 |
ParthaPRay/llm_evaluation_metrics_localized
This repo contains code for localized LLM evaluation metrics vis a framework... |
|
Experimental |
| 44 |
phbillet/LLMs-basic_linear_algebra
Basic testing of the arithmetic and linear algebra skills of two LLMs (Bard... |
|
Experimental |