Evaluation Frameworks Metrics Transformer Models

There are 44 evaluation frameworks metrics models tracked. 3 score above 50 (established tier). The highest-rated is eth-sri/matharena at 52/100 with 229 stars.

Get all 44 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=evaluation-frameworks-metrics&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Model Score Tier
1 eth-sri/matharena

Evaluation of LLMs on latest math competitions

52
Established
2 tatsu-lab/alpaca_eval

An automatic evaluator for instruction-following language models....

51
Established
3 HPAI-BSC/TuRTLe

TuRTLe: A Unified Evaluation of LLMs for RTL Generation 🐢 (MLCAD 2025)

50
Established
4 nlp-uoregon/mlmm-evaluation

Multilingual Large Language Models Evaluation Benchmark

42
Emerging
5 haesleinhuepf/human-eval-bia

Benchmarking Large Language Models for Bio-Image Analysis Code Generation

41
Emerging
6 ShuntaroOkuma/adapt-gauge-core

Measure LLM adaptation efficiency — how fast models learn from few examples

40
Emerging
7 pyladiesams/eval-llm-based-apps-jan2025

Create an evaluation framework for your LLM based app. Incorporate it into...

38
Emerging
8 princeton-nlp/LLMBar

[ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following

37
Emerging
9 chziakas/redeval

A library for red-teaming LLM applications with LLMs.

37
Emerging
10 JinjieNi/MixEval

The official evaluation suite and dynamic data release for MixEval.

37
Emerging
11 grigio/llm-eval-simple

llm-eval-simple is a simple LLM evaluation framework with intermediate...

36
Emerging
12 Praveengovianalytics/falcon-evaluate

Falcon Evaluate is an open-source Python library aims to revolutionise the...

36
Emerging
13 franckalbinet/iomeval

Streamline evaluation evidence mapping at scale with LLMs

33
Emerging
14 GURPREETKAURJETHRA/LLMs-Evaluation

LLMs Evaluation

33
Emerging
15 TIGER-AI-Lab/TIGERScore

"TIGERScore: Towards Building Explainable Metric for All Text Generation...

32
Emerging
16 waltonfuture/Diff-eRank

[NeurIPS 2024] A Novel Rank-Based Metric for Evaluating Large Language Models

31
Emerging
17 telekom/llm_evaluation_results

LLM evaluation results

31
Emerging
18 brucewlee/nutcracker

Large Model Evaluation Experiments

29
Experimental
19 Praful932/llmsearch

Find better generation parameters for your LLM

27
Experimental
20 aws-samples/model-as-a-judge-eval

Notebooks for evaluating LLM based applications using the Model (LLM) as a...

26
Experimental
21 alphadl/OOP-eval

The first Object-Oriented Programming (OOP) Evaluation Benchmark for LLMs

25
Experimental
22 open-compass/Ada-LEval

The official implementation of "Ada-LEval: Evaluating long-context LLMs with...

25
Experimental
23 horde-research/horde-common

Shared scripts for offline Kazakh LLM eval—run inference, auto-score, and...

24
Experimental
24 DigitalHarborFoundation/FlexEval

FlexEval is an LLM evaluation tool designed for practical quantitative analysis.

24
Experimental
25 jiayuww/SpatialEval

[NeurIPS'24] SpatialEval: a benchmark to evaluate spatial reasoning...

23
Experimental
26 GAIR-NLP/scaleeval

Scalable Meta-Evaluation of LLMs as Evaluators

23
Experimental
27 wahhyun/llm-eval

Evaluate large language models with tools for performance and consistency...

22
Experimental
28 AkhileshMalthi/llm-eval-framework

A production-grade framework for evaluating Large Language Model (LLM)...

21
Experimental
29 vjroy/routeeval

RouteEval: A benchmark for evaluating LLM tool calling in running route...

21
Experimental
30 UMass-Meta-LLM-Eval/llm_eval

A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup...

21
Experimental
31 McTosh1/modal-llm-evaluator

⚡ Evaluate LLM prompts at scale with fast, parallel execution, real-time...

21
Experimental
32 giuliano-t/llm-financial-regulatory-auditor

A structured evaluation pipeline for LLM-generated outputs in financial...

18
Experimental
33 user1342/conjecture

Evaluating the likelihood of data points in a LLM's training set

18
Experimental
34 franciellevargas/MFTCXplain

MFTCXplain is the first multilingual benchmark dataset designed to evaluate...

17
Experimental
35 SouravD-Me/LLM-Evaluation-Dashboard

A Visual Dashboard for Fundamental Benchmarking of LLMs

17
Experimental
36 sanand0/llmmath

How good are LLMs at mental math? An evaluation across 50 models from...

16
Experimental
37 david-xander/measuring-llm-knowledge

How much does an LLM know about my programming language?

15
Experimental
38 serhiismetanskyi/llm-output-evaluation-with-deepeval

DeepEval LLM quality evaluation tests with LLM-as-a-judge

14
Experimental
39 aims-foundations/reeval

Reliable and Efficient Model-based Generative Model Evaluation

14
Experimental
40 dustalov/llmfao

Large Language Model Feedback Analysis and Optimization (LLMFAO)

14
Experimental
41 2pa4ul2/Easygen-v2

Exam Generation With Large Language Model (LLMs)

13
Experimental
42 lfoppiano/mining-llm-evaluation-paper

Source of the article "Mining experimental data from Materials Science...

12
Experimental
43 ParthaPRay/llm_evaluation_metrics_localized

This repo contains code for localized LLM evaluation metrics vis a framework...

11
Experimental
44 phbillet/LLMs-basic_linear_algebra

Basic testing of the arithmetic and linear algebra skills of two LLMs (Bard...

11
Experimental