Evaluation Frameworks Metrics LLM Tools

Tools for building, running, and standardizing LLM evaluation systems with multiple metrics, benchmarking pipelines, and automated scoring. Does NOT include domain-specific benchmarks (math, code, reasoning) or safety/robustness-focused evaluations.

There are 103 evaluation frameworks metrics tools tracked. 2 score above 70 (verified tier). The highest-rated is EvolvingLMMs-Lab/lmms-eval at 78/100 with 3,883 stars. 3 of the top 10 are actively maintained.

Get all 103 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=evaluation-frameworks-metrics&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	EvolvingLMMs-Lab/lmms-eval One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks	78	Verified	3,883	Python
2	vibrantlabsai/ragas Supercharge Your LLM Application Evaluations 🚀	70	Verified	12,927	Python
3	open-compass/VLMEvalKit Open-source evaluation toolkit of large multi-modality models (LMMs),...	69	Established	3,894	Python
4	EuroEval/EuroEval The robust European language model benchmark.	67	Established	164	Python
5	Giskard-AI/giskard-oss 🐢 Open-Source Evaluation & Testing library for LLM Agents	65	Established	5,158	Python
6	evalplus/evalplus Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024	60	Established	1,699	Python
7	aiverify-foundation/moonshot Moonshot - A simple and modular tool to evaluate and red-team any LLM application.	58	Established	315	Python
8	DebarghaG/proofofthought Proof of thought : LLM-based reasoning using Z3 theorem proving with...	57	Established	365	Python
9	dustalov/evalica Evalica, your favourite evaluation toolkit	55	Established	62	Python
10	parameterlab/MASEval Multi-Agent LLM Evaluation	55	Established	18	Python
11	mohsenhariri/scorio Statistical evaluation, comparison, and ranking of Large Language Models	53	Established	5	Python
12	fiddler-labs/fiddler-auditor Fiddler Auditor is a tool to evaluate language models.	51	Established	189	Python
13	evo-eval/evoeval EvoEval: Evolving Coding Benchmarks via LLM	50	Established	81	Python
14	huggingface/evaluation-guidebook Sharing both practical insights and theoretical knowledge about LLM...	49	Emerging	2,075	Jupyter Notebook
15	IAAR-Shanghai/xFinder [ICLR 2025] xFinder: Large Language Models as Automated Evaluators for...	49	Emerging	180	Python
16	lean-dojo/ReProver Retrieval-Augmented Theorem Provers for Lean	49	Emerging	318	Python
17	sciknoworg/YESciEval YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering...	47	Emerging	10	Python
18	zli12321/qa_metrics An easy python package to run quick basic QA evaluations. This package...	46	Emerging	61	Python
19	InternScience/SciEvalKit A unified evaluation toolkit and leaderboard for rigorously assessing the...	46	Emerging	74	Python
20	kieranklaassen/leva LLM Evaluation Framework for Rails apps to be used with production data.	45	Emerging	133	HTML
21	bowen-upenn/PersonaMem [COLM 2025] Know Me, Respond to Me: Benchmarking LLMs for Dynamic User...	45	Emerging	119	Python
22	prometheus-eval/prometheus-eval Evaluate your LLM's response with Prometheus and GPT4 💯	44	Emerging	1,051	Python
23	ai-twinkle/Eval Twinkle Eval：高效且準確的 AI 評測工具	44	Emerging	89	Python
24	IS2Lab/S-Eval S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large...	44	Emerging	111	—
25	flexpa/llm-fhir-eval Benchmarking Large Language Models for FHIR	43	Emerging	42	TypeScript
26	SeekingDream/Static-to-Dynamic-LLMEval The official GitHub repository of the paper "Recent advances in large...	43	Emerging	547	—
27	FastEval/FastEval Fast & more realistic evaluation of chat language models. Includes leaderboard.	42	Emerging	189	Python
28	langwatch/langevals LangEvals aggregates various language model evaluators into a single...	41	Emerging	71	—
29	VikhrModels/ru_llm_arena Modified Arena-Hard-Auto LLM evaluation toolkit with an emphasis on Russian language	41	Emerging	47	Python
30	namin/llm-verified-with-monte-carlo-tree-search LLM verified with Monte Carlo Tree Search	41	Emerging	289	Jupyter Notebook
31	alopatenko/LLMEvaluation A comprehensive guide to LLM evaluation methods designed to assist in...	40	Emerging	181	HTML
32	root-signals/scorable-sdk Scorable SDK	40	Emerging	13	Python
33	IAAR-Shanghai/UHGEval [ACL 2024] User-friendly evaluation framework: Eval Suite & Benchmarks:...	40	Emerging	180	Python
34	mims-harvard/Qworld Qworld: Question-Specific Evaluation Criteria for LLMs	39	Emerging	20	Python
35	OpenDCAI/One-Eval Automated system for LLM evaluation via agents.	39	Emerging	24	Python
36	wgryc/phasellm Large language model evaluation and workflow framework from Phase AI.	39	Emerging	460	Python
37	superagent-ai/poker-eval A comprehensive tool for assessing AI Agents performance in simulated poker...	38	Emerging	21	TypeScript
38	lmarena/search-arena ⚔️ [ICLR 2026] Official code of "Search Arena: Analyzing Search-Augmented LLMs".	38	Emerging	53	Jupyter Notebook
39	multinear/multinear Develop reliable AI apps	38	Emerging	44	Python
40	HiThink-Research/GAGE General AI evaluation and Gauge Engine. A unified evaluation engine for...	38	Emerging	42	Python
41	terryyz/ice-score [EACL 2024] ICE-Score: Instructing Large Language Models to Evaluate Code	38	Emerging	80	Python
42	MLGroupJLU/LLM-eval-survey The official GitHub page for the survey paper "A Survey on Evaluation of...	37	Emerging	1,591	—
43	mlchrzan/pairadigm Concept-Guided Chain-of-Thought (CGCoT) pairwise annotation tool for...	37	Emerging	4	Jupyter Notebook
44	ZeroSumEval/ZeroSumEval A framework for pitting LLMs against each other in an evolving library of games ⚔	36	Emerging	35	Python
45	gordicaleksa/serbian-llm-eval Serbian LLM Eval.	36	Emerging	97	Python
46	sileod/llm-theory-of-mind Testing Theory of Mind (ToM) in language models with epistemic logic	36	Emerging	22	Python
47	CS-EVAL/CS-Eval CS-Eval is a comprehensive evaluation suite for fundamental cybersecurity...	35	Emerging	60	—
48	Re-Align/just-eval A simple GPT-based evaluation tool for multi-aspect, interpretable...	35	Emerging	90	Python
49	IAAR-Shanghai/GuessArena [ACL 2025] GuessArena: Guess Who I Am? A Self-Adaptive Framework for...	35	Emerging	9	Python
50	Cohere-Labs/multilingual-llm-evaluation-checklist mLLM evaluation checklist	34	Emerging	5	—
51	Contextualist/lone-arena Self-hosted LLM chatbot arena, with yourself as the only judge	34	Emerging	41	Python
52	sinanuozdemir/oreilly-evaluating-llms Metrics, Benchmarks, and Practical Tools for Assessing Large Language Models	34	Emerging	26	—
53	RGGH/evaluate Evaluate - The Robust LLM Testing Framework 🦀	34	Emerging	7	Rust
54	zorse-project/COBOLEval Evaluate LLM-generated COBOL	34	Emerging	43	Python
55	epam/ai-dial-rag-eval A python library designed for RAG (Retrieval-Augmented Generation)...	33	Emerging	5	Python
56	AMDResearch/NPUEval NPUEval is an LLM evaluation dataset written specifically to target AIE...	33	Emerging	30	C++
57	Azure-Samples/llm-eval-grader-samples Framework for Post-production Evaluation of LLM based ChatBots	33	Emerging	5	Python
58	ElevenLiy/MATEval MATEval is the first multi-agent framework simulating human collaborative...	32	Emerging	28	Python
59	allenai/CommonGen-Eval Evaluating LLMs with CommonGen-Lite	31	Emerging	95	Python
60	mags0ft/hle-eval-ollama An easy-to-use evaluation tool for running Humanity's Last Exam on (locally)...	31	Emerging	4	Python
61	kaistAI/FLASK [ICLR 2024 Spotlight] FLASK: Fine-grained Language Model Evaluation based on...	31	Emerging	217	Python
62	vienneraphael/layton-eval layton-eval is an AI eval benchmark for divergent, out-of-the-box and...	30	Emerging	1	JavaScript
63	evalkit/evalkit The TypeScript LLM Evaluation Library	30	Emerging	155	TypeScript
64	GAI-Community/GraphOmni Enable Comprehensive LLM Evaluation on Graph Reasoning	30	Emerging	76	Python
65	hitz-zentroa/latxa Latxa: An Open Language Model and Evaluation Suite for Basque	29	Experimental	32	Shell
66	deshwalmahesh/PHUDGE Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your...	29	Experimental	52	Jupyter Notebook
67	IngestAI/deepmark Deepmark AI enables a unique testing environment for language models (LLM)...	29	Experimental	104	PHP
68	linhaowei1/kumo ☁️ KUMO: Generative Evaluation of Complex Reasoning in Large Language Models	29	Experimental	19	Jupyter Notebook
69	Humanity-s-Last-Code-Exam/HLCE (EMNLP 2025 Findings) Source Evaluation scripts for Humanity's Last Code Exam	28	Experimental	95	Python
70	jacobkandel/llm-content-moderation-analysis Open-Source benchmark tracking LLM censorship and content moderation bias...	27	Experimental	1	HTML
71	mankinds/mankinds-eval Open-source Python library for evaluating AI systems	27	Experimental	3	Python
72	prorok9898/ERR-EVAL 🔍 Evaluate AI models' ability to detect ambiguity and manage uncertainty...	27	Experimental	1	Python
73	Aysnc-Labs/llm-eval A PHP package for evaluating LLM outputs. Test your prompts, validate...	27	Experimental	1	PHP
74	spenceryonce/LLMeval Evaluate and compare large language models (LLMs) for chatbot applications,...	26	Experimental	11	Python
75	AtomEcho/AtomBulb 旨在对当前主流LLM进行一个直观、具体、标准的评测	25	Experimental	94	—
76	Goodeye-Labs/truesight-docs Official documentation for Truesight — an AI evaluation platform for scoring...	25	Experimental	1	—
77	claw-eval/claw-eval Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks...	24	Experimental	68	Python
78	hparreao/Awesome-AI-Evaluation-Guide A comprehensive, implementation-focused guide to evaluating Large Language...	24	Experimental	11	—
79	hnshah/verdict LLM eval framework. Compare any model via OpenAI-compatible API.	22	Experimental	—	TypeScript
80	psandhaas/evaLLM QA framework for evaluating LLM outputs based on user-defined metrics	22	Experimental	—	Python
81	Linlichinese/rail-score 🚀 Enable accurate assessment of AI models with the RAIL Score Python SDK,...	22	Experimental	—	Python
82	LeonEricsson/llmjudge Exploring limitations of LLM-as-a-judge	22	Experimental	20	Jupyter Notebook
83	CSLiJT/awesome-lm-evaluation-methodologies Frontier papers in the evaluation methodologies of language models.	21	Experimental	10	—
84	jaaack-wang/multi-problem-eval-llm Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing...	21	Experimental	3	Jupyter Notebook
85	framersai/promptmachine-eval LLM evaluation framework with ELO ratings, arena battles, and benchmark testing	20	Experimental	1	Python
86	JinjieNi/MixEval-X The official github repo for MixEval-X, the first any-to-any, real-world benchmark.	19	Experimental	16	Python
87	Yifan-Song793/GoodBadGreedy The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore...	19	Experimental	30	Python
88	vakyansh/truthfulqa_indic Truthfulqa_indic, Available in Hindi, Punjabi, Kannada, Tamil and Telugu	19	Experimental	4	—
89	arjunpatel7/alakazam-vgc An LLM powered speed check assistant for Pokemon VGC Players	18	Experimental	2	Python
90	crux82/wikigame-llm-eval Companion repo for CLiC-it 2025 paper on WikiGame. Reproducible pipeline to...	18	Experimental	1	Python
91	lechmazur/sycophancy LLM benchmark and leaderboard for narrator-bias sycophancy,...	18	Experimental	13	—
92	Vibhanshu-555/Human-Aligned-LLM-Evaluation-Audit A data-driven audit of AI judge reliability using MT-Bench human...	17	Experimental	—	HTML
93	thisguymartin/burro Burro is a command-line interface (CLI) tool built with Deno for evaluating...	16	Experimental	2	TypeScript
94	BhuvanDontha/YouTube-policy-enforcement-auditor Independent YouTube evaluation framework for content policy classification....	15	Experimental	1	Python
95	Duguce/GuessArena-Demo A web-based interactive demo for the GuessArena evaluation framework	15	Experimental	1	HTML
96	IyatomiLab/extended-jcm MTLE implementation and the eJCM dataset, an extension of JCommonsenseMorality.	15	Experimental	1	Python
97	MukundaKatta/redpill The Red Pill Test — Can LLMs recognize the boundaries of their own reality?...	14	Experimental	—	Python
98	djador13/moderatefocus 🔍 Analyze community moderation and platform policies with the ModerateFocus...	14	Experimental	—	Python
99	Theepankumargandhi/llm-annotation-quality-pipeline Production-grade pipeline for validating annotation consistency and...	14	Experimental	—	Python
100	DavidShableski/llm-evaluation-framework A production-grade platform to evaluate and compare the performance of Large...	13	Experimental	—	TypeScript
101	The-Learning-Algorithm/ai-judge-pipeline A comprehensive pipeline for generating, analyzing, and evaluating models...	13	Experimental	—	Python
102	horde-research/Kaz-Offline-Arena Offline LLM evaluation pipeline for Kazakh: run local HF models, auto-judge,...	11	Experimental	1	Jupyter Notebook
103	Paulyang80/LLMEvaluation-A100-vs-V100- Found out that using A100 and V100 on Vicuna and Llama2 have a different...	11	Experimental	3	Jupyter Notebook

Comparisons in this category

lmms-eval and VLMEvalKit (78 vs 69) lmms-eval and evaluation-guidebook (78 vs 49) lmms-eval and MASEval (78 vs 55)