Evaluation Frameworks Metrics Transformer Models

There are 44 evaluation frameworks metrics models tracked. 3 score above 50 (established tier). The highest-rated is eth-sri/matharena at 52/100 with 229 stars.

Get all 44 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=evaluation-frameworks-metrics&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Model	Score	Tier	Stars	Language
1	eth-sri/matharena Evaluation of LLMs on latest math competitions	52	Established	229	Python
2	tatsu-lab/alpaca_eval An automatic evaluator for instruction-following language models....	51	Established	1,957	Jupyter Notebook
3	HPAI-BSC/TuRTLe TuRTLe: A Unified Evaluation of LLMs for RTL Generation 🐢 (MLCAD 2025)	50	Established	40	Python
4	nlp-uoregon/mlmm-evaluation Multilingual Large Language Models Evaluation Benchmark	42	Emerging	132	Python
5	haesleinhuepf/human-eval-bia Benchmarking Large Language Models for Bio-Image Analysis Code Generation	41	Emerging	25	Jupyter Notebook
6	ShuntaroOkuma/adapt-gauge-core Measure LLM adaptation efficiency — how fast models learn from few examples	40	Emerging	5	Python
7	pyladiesams/eval-llm-based-apps-jan2025 Create an evaluation framework for your LLM based app. Incorporate it into...	38	Emerging	8	Jupyter Notebook
8	princeton-nlp/LLMBar [ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following	37	Emerging	137	Python
9	chziakas/redeval A library for red-teaming LLM applications with LLMs.	37	Emerging	29	Python
10	JinjieNi/MixEval The official evaluation suite and dynamic data release for MixEval.	37	Emerging	255	Python
11	grigio/llm-eval-simple llm-eval-simple is a simple LLM evaluation framework with intermediate...	36	Emerging	59	Python
12	Praveengovianalytics/falcon-evaluate Falcon Evaluate is an open-source Python library aims to revolutionise the...	36	Emerging	14	Python
13	franckalbinet/iomeval Streamline evaluation evidence mapping at scale with LLMs	33	Emerging	1	Jupyter Notebook
14	GURPREETKAURJETHRA/LLMs-Evaluation LLMs Evaluation	33	Emerging	3	Jupyter Notebook
15	TIGER-AI-Lab/TIGERScore "TIGERScore: Towards Building Explainable Metric for All Text Generation...	32	Emerging	32	Jupyter Notebook
16	waltonfuture/Diff-eRank [NeurIPS 2024] A Novel Rank-Based Metric for Evaluating Large Language Models	31	Emerging	57	Python
17	telekom/llm_evaluation_results LLM evaluation results	31	Emerging	4	Jupyter Notebook
18	brucewlee/nutcracker Large Model Evaluation Experiments	29	Experimental	7	Python
19	Praful932/llmsearch Find better generation parameters for your LLM	27	Experimental	27	Python
20	aws-samples/model-as-a-judge-eval Notebooks for evaluating LLM based applications using the Model (LLM) as a...	26	Experimental	3	Jupyter Notebook
21	alphadl/OOP-eval The first Object-Oriented Programming (OOP) Evaluation Benchmark for LLMs	25	Experimental	27	Python
22	open-compass/Ada-LEval The official implementation of "Ada-LEval: Evaluating long-context LLMs with...	25	Experimental	56	Python
23	horde-research/horde-common Shared scripts for offline Kazakh LLM eval—run inference, auto-score, and...	24	Experimental	—	Jupyter Notebook
24	DigitalHarborFoundation/FlexEval FlexEval is an LLM evaluation tool designed for practical quantitative analysis.	24	Experimental	16	Python
25	jiayuww/SpatialEval [NeurIPS'24] SpatialEval: a benchmark to evaluate spatial reasoning...	23	Experimental	59	Python
26	GAIR-NLP/scaleeval Scalable Meta-Evaluation of LLMs as Evaluators	23	Experimental	43	Python
27	wahhyun/llm-eval Evaluate large language models with tools for performance and consistency...	22	Experimental	—	C++
28	AkhileshMalthi/llm-eval-framework A production-grade framework for evaluating Large Language Model (LLM)...	21	Experimental	—	Python
29	vjroy/routeeval RouteEval: A benchmark for evaluating LLM tool calling in running route...	21	Experimental	—	TeX
30	UMass-Meta-LLM-Eval/llm_eval A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup...	21	Experimental	9	Python
31	McTosh1/modal-llm-evaluator ⚡ Evaluate LLM prompts at scale with fast, parallel execution, real-time...	21	Experimental	—	Python
32	giuliano-t/llm-financial-regulatory-auditor A structured evaluation pipeline for LLM-generated outputs in financial...	18	Experimental	1	Jupyter Notebook
33	user1342/conjecture Evaluating the likelihood of data points in a LLM's training set	18	Experimental	2	Python
34	franciellevargas/MFTCXplain MFTCXplain is the first multilingual benchmark dataset designed to evaluate...	17	Experimental	3	Jupyter Notebook
35	SouravD-Me/LLM-Evaluation-Dashboard A Visual Dashboard for Fundamental Benchmarking of LLMs	17	Experimental	1	Jupyter Notebook
36	sanand0/llmmath How good are LLMs at mental math? An evaluation across 50 models from...	16	Experimental	—	JavaScript
37	david-xander/measuring-llm-knowledge How much does an LLM know about my programming language?	15	Experimental	—	Jupyter Notebook
38	serhiismetanskyi/llm-output-evaluation-with-deepeval DeepEval LLM quality evaluation tests with LLM-as-a-judge	14	Experimental	—	Python
39	aims-foundations/reeval Reliable and Efficient Model-based Generative Model Evaluation	14	Experimental	7	Jupyter Notebook
40	dustalov/llmfao Large Language Model Feedback Analysis and Optimization (LLMFAO)	14	Experimental	3	Jupyter Notebook
41	2pa4ul2/Easygen-v2 Exam Generation With Large Language Model (LLMs)	13	Experimental	—	Python
42	lfoppiano/mining-llm-evaluation-paper Source of the article "Mining experimental data from Materials Science...	12	Experimental	7	TeX
43	ParthaPRay/llm_evaluation_metrics_localized This repo contains code for localized LLM evaluation metrics vis a framework...	11	Experimental	—	Python
44	phbillet/LLMs-basic_linear_algebra Basic testing of the arithmetic and linear algebra skills of two LLMs (Bard...	11	Experimental	—	—

Comparisons in this category

LLMs-Evaluation and llm_evaluation_results (33 vs 31)