LLM Evaluation Benchmarking ML Frameworks

Frameworks, platforms, and benchmarks for systematically evaluating and comparing LLM performance across metrics like accuracy, safety, reliability, and cost. Does NOT include general LLM applications, deployment tools, or inference optimization.

There are 41 llm evaluation benchmarking frameworks tracked. 1 score above 70 (verified tier). The highest-rated is Cloud-CV/EvalAI at 70/100 with 2,013 stars.

Get all 41 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=ml-frameworks&subcategory=llm-evaluation-benchmarking&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Framework Score Tier
1 Cloud-CV/EvalAI

:cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of...

70
Verified
2 fireindark707/Python-Schema-Matching

A python tool using XGboost and sentence-transformers to perform schema...

60
Established
3 graphbookai/graphbook

Visual AI development framework for training and inference of ML models,...

57
Established
4 visual-layer/fastdup

fastdup is a powerful, free tool designed to rapidly generate valuable...

51
Established
5 github/CodeSearchNet

Datasets, tools, and benchmarks for representation learning of code.

49
Emerging
6 tthtlc/awesome-source-analysis

Source code understanding via Machine Learning techniques

47
Emerging
7 Xenios91/Glyph

An architecture independent binary analysis tool for fingerprinting...

46
Emerging
8 RAILethicsHub/rail-score

Python SDK

44
Emerging
9 Striveworks/valor

Valor is a lightweight, numpy-based library designed for fast and seamless...

43
Emerging
10 paceval/paceval

paceval is a high-performance mathematical runtime for deterministic AI and...

43
Emerging
11 RoboticsData/score_lerobot_episodes

A lightweight toolkit for quantitatively scoring LeRobot episodes.

42
Emerging
12 greynewell/evaldriven.org

Ship evals before you ship features.

42
Emerging
13 josh-ashkinaze/plurals

Plurals: A System for Guiding LLMs Via Simulated Social Ensembles

40
Emerging
14 AIT-Protocol/einstein-ait-prod

Supercharge Bittensor Ecosystem with Advanced Mathematical and Logical AI

36
Emerging
15 khoj-ai/llm-coup

Let LLMs play coup with each other and see who's the best at deception & strategy

36
Emerging
16 GustyCube/ERR-EVAL

Benchmark for evaluating AI epistemic reliability - testing how well LLMs...

32
Emerging
17 Fir121/llm-classifier

Structured LLM based classification, clustering and extraction framework...

32
Emerging
18 lac-dcc/yali

A framework to analyze a space formed by the combination of program...

31
Emerging
19 metriccoders/ml-models

This is the Metric Coders Model Hub that contains the fastest growing tiny...

30
Emerging
20 Vatshayan/Data-Duplication-Removal-using-Machine-Learning

Final Year Project as Deletion of Duplicated data using Machine learning...

29
Experimental
21 wkdhkr/dedupper

import various files, detect duplicates with sqlite, reject image file by...

28
Experimental
22 ppashakhanloo/CodeTrek

A powerful relational representation of source code

28
Experimental
23 MPX0222/BroadLearningSystem-APIs-1.0

Modification for Broad Learning System, including BLS, CNN-BLS, PCA-BLS. Now...

24
Experimental
24 lof310/arch_eval

arch_eval is a high-level library for efficient architecture evaluation of...

24
Experimental
25 semantic-parsing/semantic-parsing.github.io

Website for "A Survey of Modeling and Data resources for Semantic Parsing"

24
Experimental
26 ztsalexey/epoch-bench

EPOCH: Evaluating Progress Origins in Causal History — LLM benchmark for...

23
Experimental
27 WINSTON672/lin-score

The Lin (𝓛) — a fundamental unit of AI cognitive efficiency. Like miles per...

22
Experimental
28 MukundaKatta/CacheLLM

Semantic caching for LLM responses — n-gram similarity matching, SQLite...

22
Experimental
29 theMethodolojeeOrg/SkynetBench

A rigorous methodology for detecting authority pressure's effect on AI...

22
Experimental
30 MukundaKatta/LLMProxy

Unified API proxy for LLM providers — OpenAI, Anthropic with fallback...

22
Experimental
31 cr7yash/EvalForge

LLM evaluation platform with 13+ metrics across accuracy, performance, and...

21
Experimental
32 jubaedemon/LBBS-Standard

💰 Establish a standard for LLM billing and benchmarking to enable fair...

21
Experimental
33 Fardeen37/Data-Duplication-Remover-ML

A powerful machine learning based tool for detecting, analyzing, and...

20
Experimental
34 voxel51/reconstruction-error-ratios

Estimate dataset difficulty and detect label mistakes using reconstruction...

15
Experimental
35 jerarddxb-ops/excuse-evaluation-dataset

Rubric-based evaluation dataset simulating RLHF-style AI annotation,...

14
Experimental
36 pzzkkj324244/Bench2Drive-Leaderboard

🚗 Track and compare performance of all methods tested on Bench2Drive,...

14
Experimental
37 syifatoo2751/CC-RLM

Reduce token use by delivering targeted code context to local LLMs with a...

14
Experimental
38 danghoawe/gg-keeper

🔍 Monitor your Giffgaff SIM card data usage easily with this lightweight...

14
Experimental
39 Docktorjjd/llm-evaluation-framework

Automated evaluation and testing framework for LLM applications

13
Experimental
40 redoh/llm-code-analyzer

🔬 LLM-based static code analysis engine with semantic understanding

13
Experimental
41 RapidataAI/crowd-eval

Break out of the AI training bubble

13
Experimental

Comparisons in this category