LLM Evaluation Benchmarking ML Frameworks
Frameworks, platforms, and benchmarks for systematically evaluating and comparing LLM performance across metrics like accuracy, safety, reliability, and cost. Does NOT include general LLM applications, deployment tools, or inference optimization.
There are 41 llm evaluation benchmarking frameworks tracked. 1 score above 70 (verified tier). The highest-rated is Cloud-CV/EvalAI at 70/100 with 2,013 stars.
Get all 41 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=ml-frameworks&subcategory=llm-evaluation-benchmarking&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Framework | Score | Tier |
|---|---|---|---|
| 1 |
Cloud-CV/EvalAI
:cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of... |
|
Verified |
| 2 |
fireindark707/Python-Schema-Matching
A python tool using XGboost and sentence-transformers to perform schema... |
|
Established |
| 3 |
graphbookai/graphbook
Visual AI development framework for training and inference of ML models,... |
|
Established |
| 4 |
visual-layer/fastdup
fastdup is a powerful, free tool designed to rapidly generate valuable... |
|
Established |
| 5 |
github/CodeSearchNet
Datasets, tools, and benchmarks for representation learning of code. |
|
Emerging |
| 6 |
tthtlc/awesome-source-analysis
Source code understanding via Machine Learning techniques |
|
Emerging |
| 7 |
Xenios91/Glyph
An architecture independent binary analysis tool for fingerprinting... |
|
Emerging |
| 8 |
RAILethicsHub/rail-score
Python SDK |
|
Emerging |
| 9 |
Striveworks/valor
Valor is a lightweight, numpy-based library designed for fast and seamless... |
|
Emerging |
| 10 |
paceval/paceval
paceval is a high-performance mathematical runtime for deterministic AI and... |
|
Emerging |
| 11 |
RoboticsData/score_lerobot_episodes
A lightweight toolkit for quantitatively scoring LeRobot episodes. |
|
Emerging |
| 12 |
greynewell/evaldriven.org
Ship evals before you ship features. |
|
Emerging |
| 13 |
josh-ashkinaze/plurals
Plurals: A System for Guiding LLMs Via Simulated Social Ensembles |
|
Emerging |
| 14 |
AIT-Protocol/einstein-ait-prod
Supercharge Bittensor Ecosystem with Advanced Mathematical and Logical AI |
|
Emerging |
| 15 |
khoj-ai/llm-coup
Let LLMs play coup with each other and see who's the best at deception & strategy |
|
Emerging |
| 16 |
GustyCube/ERR-EVAL
Benchmark for evaluating AI epistemic reliability - testing how well LLMs... |
|
Emerging |
| 17 |
Fir121/llm-classifier
Structured LLM based classification, clustering and extraction framework... |
|
Emerging |
| 18 |
lac-dcc/yali
A framework to analyze a space formed by the combination of program... |
|
Emerging |
| 19 |
metriccoders/ml-models
This is the Metric Coders Model Hub that contains the fastest growing tiny... |
|
Emerging |
| 20 |
Vatshayan/Data-Duplication-Removal-using-Machine-Learning
Final Year Project as Deletion of Duplicated data using Machine learning... |
|
Experimental |
| 21 |
wkdhkr/dedupper
import various files, detect duplicates with sqlite, reject image file by... |
|
Experimental |
| 22 |
ppashakhanloo/CodeTrek
A powerful relational representation of source code |
|
Experimental |
| 23 |
MPX0222/BroadLearningSystem-APIs-1.0
Modification for Broad Learning System, including BLS, CNN-BLS, PCA-BLS. Now... |
|
Experimental |
| 24 |
lof310/arch_eval
arch_eval is a high-level library for efficient architecture evaluation of... |
|
Experimental |
| 25 |
semantic-parsing/semantic-parsing.github.io
Website for "A Survey of Modeling and Data resources for Semantic Parsing" |
|
Experimental |
| 26 |
ztsalexey/epoch-bench
EPOCH: Evaluating Progress Origins in Causal History — LLM benchmark for... |
|
Experimental |
| 27 |
WINSTON672/lin-score
The Lin (𝓛) — a fundamental unit of AI cognitive efficiency. Like miles per... |
|
Experimental |
| 28 |
MukundaKatta/CacheLLM
Semantic caching for LLM responses — n-gram similarity matching, SQLite... |
|
Experimental |
| 29 |
theMethodolojeeOrg/SkynetBench
A rigorous methodology for detecting authority pressure's effect on AI... |
|
Experimental |
| 30 |
MukundaKatta/LLMProxy
Unified API proxy for LLM providers — OpenAI, Anthropic with fallback... |
|
Experimental |
| 31 |
cr7yash/EvalForge
LLM evaluation platform with 13+ metrics across accuracy, performance, and... |
|
Experimental |
| 32 |
jubaedemon/LBBS-Standard
💰 Establish a standard for LLM billing and benchmarking to enable fair... |
|
Experimental |
| 33 |
Fardeen37/Data-Duplication-Remover-ML
A powerful machine learning based tool for detecting, analyzing, and... |
|
Experimental |
| 34 |
voxel51/reconstruction-error-ratios
Estimate dataset difficulty and detect label mistakes using reconstruction... |
|
Experimental |
| 35 |
jerarddxb-ops/excuse-evaluation-dataset
Rubric-based evaluation dataset simulating RLHF-style AI annotation,... |
|
Experimental |
| 36 |
pzzkkj324244/Bench2Drive-Leaderboard
🚗 Track and compare performance of all methods tested on Bench2Drive,... |
|
Experimental |
| 37 |
syifatoo2751/CC-RLM
Reduce token use by delivering targeted code context to local LLMs with a... |
|
Experimental |
| 38 |
danghoawe/gg-keeper
🔍 Monitor your Giffgaff SIM card data usage easily with this lightweight... |
|
Experimental |
| 39 |
Docktorjjd/llm-evaluation-framework
Automated evaluation and testing framework for LLM applications |
|
Experimental |
| 40 |
redoh/llm-code-analyzer
🔬 LLM-based static code analysis engine with semantic understanding |
|
Experimental |
| 41 |
RapidataAI/crowd-eval
Break out of the AI training bubble |
|
Experimental |