Evaluation Frameworks Metrics LLM Tools

Tools for building, running, and standardizing LLM evaluation systems with multiple metrics, benchmarking pipelines, and automated scoring. Does NOT include domain-specific benchmarks (math, code, reasoning) or safety/robustness-focused evaluations.

There are 103 evaluation frameworks metrics tools tracked. 2 score above 70 (verified tier). The highest-rated is EvolvingLMMs-Lab/lmms-eval at 78/100 with 3,883 stars. 3 of the top 10 are actively maintained.

Get all 103 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=evaluation-frameworks-metrics&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 EvolvingLMMs-Lab/lmms-eval

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

78
Verified
2 vibrantlabsai/ragas

Supercharge Your LLM Application Evaluations 🚀

70
Verified
3 open-compass/VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs),...

69
Established
4 EuroEval/EuroEval

The robust European language model benchmark.

67
Established
5 Giskard-AI/giskard-oss

🐢 Open-Source Evaluation & Testing library for LLM Agents

65
Established
6 evalplus/evalplus

Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024

60
Established
7 aiverify-foundation/moonshot

Moonshot - A simple and modular tool to evaluate and red-team any LLM application.

58
Established
8 DebarghaG/proofofthought

Proof of thought : LLM-based reasoning using Z3 theorem proving with...

57
Established
9 dustalov/evalica

Evalica, your favourite evaluation toolkit

55
Established
10 parameterlab/MASEval

Multi-Agent LLM Evaluation

55
Established
11 mohsenhariri/scorio

Statistical evaluation, comparison, and ranking of Large Language Models

53
Established
12 fiddler-labs/fiddler-auditor

Fiddler Auditor is a tool to evaluate language models.

51
Established
13 evo-eval/evoeval

EvoEval: Evolving Coding Benchmarks via LLM

50
Established
14 huggingface/evaluation-guidebook

Sharing both practical insights and theoretical knowledge about LLM...

49
Emerging
15 IAAR-Shanghai/xFinder

[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for...

49
Emerging
16 lean-dojo/ReProver

Retrieval-Augmented Theorem Provers for Lean

49
Emerging
17 sciknoworg/YESciEval

YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering...

47
Emerging
18 zli12321/qa_metrics

An easy python package to run quick basic QA evaluations. This package...

46
Emerging
19 InternScience/SciEvalKit

A unified evaluation toolkit and leaderboard for rigorously assessing the...

46
Emerging
20 kieranklaassen/leva

LLM Evaluation Framework for Rails apps to be used with production data.

45
Emerging
21 bowen-upenn/PersonaMem

[COLM 2025] Know Me, Respond to Me: Benchmarking LLMs for Dynamic User...

45
Emerging
22 prometheus-eval/prometheus-eval

Evaluate your LLM's response with Prometheus and GPT4 💯

44
Emerging
23 ai-twinkle/Eval

Twinkle Eval:高效且準確的 AI 評測工具

44
Emerging
24 IS2Lab/S-Eval

S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large...

44
Emerging
25 flexpa/llm-fhir-eval

Benchmarking Large Language Models for FHIR

43
Emerging
26 SeekingDream/Static-to-Dynamic-LLMEval

The official GitHub repository of the paper "Recent advances in large...

43
Emerging
27 FastEval/FastEval

Fast & more realistic evaluation of chat language models. Includes leaderboard.

42
Emerging
28 langwatch/langevals

LangEvals aggregates various language model evaluators into a single...

41
Emerging
29 VikhrModels/ru_llm_arena

Modified Arena-Hard-Auto LLM evaluation toolkit with an emphasis on Russian language

41
Emerging
30 namin/llm-verified-with-monte-carlo-tree-search

LLM verified with Monte Carlo Tree Search

41
Emerging
31 alopatenko/LLMEvaluation

A comprehensive guide to LLM evaluation methods designed to assist in...

40
Emerging
32 root-signals/scorable-sdk

Scorable SDK

40
Emerging
33 IAAR-Shanghai/UHGEval

[ACL 2024] User-friendly evaluation framework: Eval Suite & Benchmarks:...

40
Emerging
34 mims-harvard/Qworld

Qworld: Question-Specific Evaluation Criteria for LLMs

39
Emerging
35 OpenDCAI/One-Eval

Automated system for LLM evaluation via agents.

39
Emerging
36 wgryc/phasellm

Large language model evaluation and workflow framework from Phase AI.

39
Emerging
37 superagent-ai/poker-eval

A comprehensive tool for assessing AI Agents performance in simulated poker...

38
Emerging
38 lmarena/search-arena

⚔️ [ICLR 2026] Official code of "Search Arena: Analyzing Search-Augmented LLMs".

38
Emerging
39 multinear/multinear

Develop reliable AI apps

38
Emerging
40 HiThink-Research/GAGE

General AI evaluation and Gauge Engine. A unified evaluation engine for...

38
Emerging
41 terryyz/ice-score

[EACL 2024] ICE-Score: Instructing Large Language Models to Evaluate Code

38
Emerging
42 MLGroupJLU/LLM-eval-survey

The official GitHub page for the survey paper "A Survey on Evaluation of...

37
Emerging
43 mlchrzan/pairadigm

Concept-Guided Chain-of-Thought (CGCoT) pairwise annotation tool for...

37
Emerging
44 ZeroSumEval/ZeroSumEval

A framework for pitting LLMs against each other in an evolving library of games ⚔

36
Emerging
45 gordicaleksa/serbian-llm-eval

Serbian LLM Eval.

36
Emerging
46 sileod/llm-theory-of-mind

Testing Theory of Mind (ToM) in language models with epistemic logic

36
Emerging
47 CS-EVAL/CS-Eval

CS-Eval is a comprehensive evaluation suite for fundamental cybersecurity...

35
Emerging
48 Re-Align/just-eval

A simple GPT-based evaluation tool for multi-aspect, interpretable...

35
Emerging
49 IAAR-Shanghai/GuessArena

[ACL 2025] GuessArena: Guess Who I Am? A Self-Adaptive Framework for...

35
Emerging
50 Cohere-Labs/multilingual-llm-evaluation-checklist

mLLM evaluation checklist

34
Emerging
51 Contextualist/lone-arena

Self-hosted LLM chatbot arena, with yourself as the only judge

34
Emerging
52 sinanuozdemir/oreilly-evaluating-llms

Metrics, Benchmarks, and Practical Tools for Assessing Large Language Models

34
Emerging
53 RGGH/evaluate

Evaluate - The Robust LLM Testing Framework 🦀

34
Emerging
54 zorse-project/COBOLEval

Evaluate LLM-generated COBOL

34
Emerging
55 epam/ai-dial-rag-eval

A python library designed for RAG (Retrieval-Augmented Generation)...

33
Emerging
56 AMDResearch/NPUEval

NPUEval is an LLM evaluation dataset written specifically to target AIE...

33
Emerging
57 Azure-Samples/llm-eval-grader-samples

Framework for Post-production Evaluation of LLM based ChatBots

33
Emerging
58 ElevenLiy/MATEval

MATEval is the first multi-agent framework simulating human collaborative...

32
Emerging
59 allenai/CommonGen-Eval

Evaluating LLMs with CommonGen-Lite

31
Emerging
60 mags0ft/hle-eval-ollama

An easy-to-use evaluation tool for running Humanity's Last Exam on (locally)...

31
Emerging
61 kaistAI/FLASK

[ICLR 2024 Spotlight] FLASK: Fine-grained Language Model Evaluation based on...

31
Emerging
62 vienneraphael/layton-eval

layton-eval is an AI eval benchmark for divergent, out-of-the-box and...

30
Emerging
63 evalkit/evalkit

The TypeScript LLM Evaluation Library

30
Emerging
64 GAI-Community/GraphOmni

Enable Comprehensive LLM Evaluation on Graph Reasoning

30
Emerging
65 hitz-zentroa/latxa

Latxa: An Open Language Model and Evaluation Suite for Basque

29
Experimental
66 deshwalmahesh/PHUDGE

Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your...

29
Experimental
67 IngestAI/deepmark

Deepmark AI enables a unique testing environment for language models (LLM)...

29
Experimental
68 linhaowei1/kumo

☁️ KUMO: Generative Evaluation of Complex Reasoning in Large Language Models

29
Experimental
69 Humanity-s-Last-Code-Exam/HLCE

(EMNLP 2025 Findings) Source Evaluation scripts for Humanity's Last Code Exam

28
Experimental
70 jacobkandel/llm-content-moderation-analysis

Open-Source benchmark tracking LLM censorship and content moderation bias...

27
Experimental
71 mankinds/mankinds-eval

Open-source Python library for evaluating AI systems

27
Experimental
72 prorok9898/ERR-EVAL

🔍 Evaluate AI models' ability to detect ambiguity and manage uncertainty...

27
Experimental
73 Aysnc-Labs/llm-eval

A PHP package for evaluating LLM outputs. Test your prompts, validate...

27
Experimental
74 spenceryonce/LLMeval

Evaluate and compare large language models (LLMs) for chatbot applications,...

26
Experimental
75 AtomEcho/AtomBulb

旨在对当前主流LLM进行一个直观、具体、标准的评测

25
Experimental
76 Goodeye-Labs/truesight-docs

Official documentation for Truesight — an AI evaluation platform for scoring...

25
Experimental
77 claw-eval/claw-eval

Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks...

24
Experimental
78 hparreao/Awesome-AI-Evaluation-Guide

A comprehensive, implementation-focused guide to evaluating Large Language...

24
Experimental
79 hnshah/verdict

LLM eval framework. Compare any model via OpenAI-compatible API.

22
Experimental
80 psandhaas/evaLLM

QA framework for evaluating LLM outputs based on user-defined metrics

22
Experimental
81 Linlichinese/rail-score

🚀 Enable accurate assessment of AI models with the RAIL Score Python SDK,...

22
Experimental
82 LeonEricsson/llmjudge

Exploring limitations of LLM-as-a-judge

22
Experimental
83 CSLiJT/awesome-lm-evaluation-methodologies

Frontier papers in the evaluation methodologies of language models.

21
Experimental
84 jaaack-wang/multi-problem-eval-llm

Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing...

21
Experimental
85 framersai/promptmachine-eval

LLM evaluation framework with ELO ratings, arena battles, and benchmark testing

20
Experimental
86 JinjieNi/MixEval-X

The official github repo for MixEval-X, the first any-to-any, real-world benchmark.

19
Experimental
87 Yifan-Song793/GoodBadGreedy

The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore...

19
Experimental
88 vakyansh/truthfulqa_indic

Truthfulqa_indic, Available in Hindi, Punjabi, Kannada, Tamil and Telugu

19
Experimental
89 arjunpatel7/alakazam-vgc

An LLM powered speed check assistant for Pokemon VGC Players

18
Experimental
90 crux82/wikigame-llm-eval

Companion repo for CLiC-it 2025 paper on WikiGame. Reproducible pipeline to...

18
Experimental
91 lechmazur/sycophancy

LLM benchmark and leaderboard for narrator-bias sycophancy,...

18
Experimental
92 Vibhanshu-555/Human-Aligned-LLM-Evaluation-Audit

A data-driven audit of AI judge reliability using MT-Bench human...

17
Experimental
93 thisguymartin/burro

Burro is a command-line interface (CLI) tool built with Deno for evaluating...

16
Experimental
94 BhuvanDontha/YouTube-policy-enforcement-auditor

Independent YouTube evaluation framework for content policy classification....

15
Experimental
95 Duguce/GuessArena-Demo

A web-based interactive demo for the GuessArena evaluation framework

15
Experimental
96 IyatomiLab/extended-jcm

MTLE implementation and the eJCM dataset, an extension of JCommonsenseMorality.

15
Experimental
97 MukundaKatta/redpill

The Red Pill Test — Can LLMs recognize the boundaries of their own reality?...

14
Experimental
98 djador13/moderatefocus

🔍 Analyze community moderation and platform policies with the ModerateFocus...

14
Experimental
99 Theepankumargandhi/llm-annotation-quality-pipeline

Production-grade pipeline for validating annotation consistency and...

14
Experimental
100 DavidShableski/llm-evaluation-framework

A production-grade platform to evaluate and compare the performance of Large...

13
Experimental
101 The-Learning-Algorithm/ai-judge-pipeline

A comprehensive pipeline for generating, analyzing, and evaluating models...

13
Experimental
102 horde-research/Kaz-Offline-Arena

Offline LLM evaluation pipeline for Kazakh: run local HF models, auto-judge,...

11
Experimental
103 Paulyang80/LLMEvaluation-A100-vs-V100-

Found out that using A100 and V100 on Vicuna and Llama2 have a different...

11
Experimental