LLM Evaluation Benchmarking ML Frameworks

Frameworks, platforms, and benchmarks for systematically evaluating and comparing LLM performance across metrics like accuracy, safety, reliability, and cost. Does NOT include general LLM applications, deployment tools, or inference optimization.

There are 41 llm evaluation benchmarking frameworks tracked. 1 score above 70 (verified tier). The highest-rated is Cloud-CV/EvalAI at 70/100 with 2,013 stars.

Get all 41 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=ml-frameworks&subcategory=llm-evaluation-benchmarking&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Framework	Score	Tier	Stars	Language
1	Cloud-CV/EvalAI :cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of...	70	Verified	2,013	Python
2	fireindark707/Python-Schema-Matching A python tool using XGboost and sentence-transformers to perform schema...	60	Established	40	Python
3	graphbookai/graphbook Visual AI development framework for training and inference of ML models,...	57	Established	46	Python
4	visual-layer/fastdup fastdup is a powerful, free tool designed to rapidly generate valuable...	51	Established	1,834	Python
5	github/CodeSearchNet Datasets, tools, and benchmarks for representation learning of code.	49	Emerging	2,417	Jupyter Notebook
6	tthtlc/awesome-source-analysis Source code understanding via Machine Learning techniques	47	Emerging	138	—
7	Xenios91/Glyph An architecture independent binary analysis tool for fingerprinting...	46	Emerging	15	Python
8	RAILethicsHub/rail-score Python SDK	44	Emerging	2	Python
9	Striveworks/valor Valor is a lightweight, numpy-based library designed for fast and seamless...	43	Emerging	40	Python
10	paceval/paceval paceval is a high-performance mathematical runtime for deterministic AI and...	43	Emerging	3	HTML
11	RoboticsData/score_lerobot_episodes A lightweight toolkit for quantitatively scoring LeRobot episodes.	42	Emerging	49	Python
12	greynewell/evaldriven.org Ship evals before you ship features.	42	Emerging	18	Nunjucks
13	josh-ashkinaze/plurals Plurals: A System for Guiding LLMs Via Simulated Social Ensembles	40	Emerging	32	Python
14	AIT-Protocol/einstein-ait-prod Supercharge Bittensor Ecosystem with Advanced Mathematical and Logical AI	36	Emerging	13	Python
15	khoj-ai/llm-coup Let LLMs play coup with each other and see who's the best at deception & strategy	36	Emerging	8	TypeScript
16	GustyCube/ERR-EVAL Benchmark for evaluating AI epistemic reliability - testing how well LLMs...	32	Emerging	9	Python
17	Fir121/llm-classifier Structured LLM based classification, clustering and extraction framework...	32	Emerging	2	Python
18	lac-dcc/yali A framework to analyze a space formed by the combination of program...	31	Emerging	36	LLVM
19	metriccoders/ml-models This is the Metric Coders Model Hub that contains the fastest growing tiny...	30	Emerging	2	—
20	Vatshayan/Data-Duplication-Removal-using-Machine-Learning Final Year Project as Deletion of Duplicated data using Machine learning...	29	Experimental	67	Jupyter Notebook
21	wkdhkr/dedupper import various files, detect duplicates with sqlite, reject image file by...	28	Experimental	8	JavaScript
22	ppashakhanloo/CodeTrek A powerful relational representation of source code	28	Experimental	33	Python
23	MPX0222/BroadLearningSystem-APIs-1.0 Modification for Broad Learning System, including BLS, CNN-BLS, PCA-BLS. Now...	24	Experimental	29	Python
24	lof310/arch_eval arch_eval is a high-level library for efficient architecture evaluation of...	24	Experimental	3	Python
25	semantic-parsing/semantic-parsing.github.io Website for "A Survey of Modeling and Data resources for Semantic Parsing"	24	Experimental	4	CSS
26	ztsalexey/epoch-bench EPOCH: Evaluating Progress Origins in Causal History — LLM benchmark for...	23	Experimental	1	Python
27	WINSTON672/lin-score The Lin (𝓛) — a fundamental unit of AI cognitive efficiency. Like miles per...	22	Experimental	—	TeX
28	MukundaKatta/CacheLLM Semantic caching for LLM responses — n-gram similarity matching, SQLite...	22	Experimental	—	Python
29	theMethodolojeeOrg/SkynetBench A rigorous methodology for detecting authority pressure's effect on AI...	22	Experimental	1	TypeScript
30	MukundaKatta/LLMProxy Unified API proxy for LLM providers — OpenAI, Anthropic with fallback...	22	Experimental	—	Python
31	cr7yash/EvalForge LLM evaluation platform with 13+ metrics across accuracy, performance, and...	21	Experimental	—	TypeScript
32	jubaedemon/LBBS-Standard 💰 Establish a standard for LLM billing and benchmarking to enable fair...	21	Experimental	—	—
33	Fardeen37/Data-Duplication-Remover-ML A powerful machine learning based tool for detecting, analyzing, and...	20	Experimental	1	Jupyter Notebook
34	voxel51/reconstruction-error-ratios Estimate dataset difficulty and detect label mistakes using reconstruction...	15	Experimental	28	Python
35	jerarddxb-ops/excuse-evaluation-dataset Rubric-based evaluation dataset simulating RLHF-style AI annotation,...	14	Experimental	—	—
36	pzzkkj324244/Bench2Drive-Leaderboard 🚗 Track and compare performance of all methods tested on Bench2Drive,...	14	Experimental	—	TeX
37	syifatoo2751/CC-RLM Reduce token use by delivering targeted code context to local LLMs with a...	14	Experimental	—	—
38	danghoawe/gg-keeper 🔍 Monitor your Giffgaff SIM card data usage easily with this lightweight...	14	Experimental	—	HTML
39	Docktorjjd/llm-evaluation-framework Automated evaluation and testing framework for LLM applications	13	Experimental	—	JavaScript
40	redoh/llm-code-analyzer 🔬 LLM-based static code analysis engine with semantic understanding	13	Experimental	—	—
41	RapidataAI/crowd-eval Break out of the AI training bubble	13	Experimental	6	Python

Comparisons in this category

EvalAI and evaldriven.org (70 vs 42)