Evaluation Frameworks Metrics LLM Tools
Tools for building, running, and standardizing LLM evaluation systems with multiple metrics, benchmarking pipelines, and automated scoring. Does NOT include domain-specific benchmarks (math, code, reasoning) or safety/robustness-focused evaluations.
There are 103 evaluation frameworks metrics tools tracked. 2 score above 70 (verified tier). The highest-rated is EvolvingLMMs-Lab/lmms-eval at 78/100 with 3,883 stars. 3 of the top 10 are actively maintained.
Get all 103 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=evaluation-frameworks-metrics&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
EvolvingLMMs-Lab/lmms-eval
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks |
|
Verified |
| 2 |
vibrantlabsai/ragas
Supercharge Your LLM Application Evaluations 🚀 |
|
Verified |
| 3 |
open-compass/VLMEvalKit
Open-source evaluation toolkit of large multi-modality models (LMMs),... |
|
Established |
| 4 |
EuroEval/EuroEval
The robust European language model benchmark. |
|
Established |
| 5 |
Giskard-AI/giskard-oss
🐢 Open-Source Evaluation & Testing library for LLM Agents |
|
Established |
| 6 |
evalplus/evalplus
Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024 |
|
Established |
| 7 |
aiverify-foundation/moonshot
Moonshot - A simple and modular tool to evaluate and red-team any LLM application. |
|
Established |
| 8 |
DebarghaG/proofofthought
Proof of thought : LLM-based reasoning using Z3 theorem proving with... |
|
Established |
| 9 |
dustalov/evalica
Evalica, your favourite evaluation toolkit |
|
Established |
| 10 |
parameterlab/MASEval
Multi-Agent LLM Evaluation |
|
Established |
| 11 |
mohsenhariri/scorio
Statistical evaluation, comparison, and ranking of Large Language Models |
|
Established |
| 12 |
fiddler-labs/fiddler-auditor
Fiddler Auditor is a tool to evaluate language models. |
|
Established |
| 13 |
evo-eval/evoeval
EvoEval: Evolving Coding Benchmarks via LLM |
|
Established |
| 14 |
huggingface/evaluation-guidebook
Sharing both practical insights and theoretical knowledge about LLM... |
|
Emerging |
| 15 |
IAAR-Shanghai/xFinder
[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for... |
|
Emerging |
| 16 |
lean-dojo/ReProver
Retrieval-Augmented Theorem Provers for Lean |
|
Emerging |
| 17 |
sciknoworg/YESciEval
YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering... |
|
Emerging |
| 18 |
zli12321/qa_metrics
An easy python package to run quick basic QA evaluations. This package... |
|
Emerging |
| 19 |
InternScience/SciEvalKit
A unified evaluation toolkit and leaderboard for rigorously assessing the... |
|
Emerging |
| 20 |
kieranklaassen/leva
LLM Evaluation Framework for Rails apps to be used with production data. |
|
Emerging |
| 21 |
bowen-upenn/PersonaMem
[COLM 2025] Know Me, Respond to Me: Benchmarking LLMs for Dynamic User... |
|
Emerging |
| 22 |
prometheus-eval/prometheus-eval
Evaluate your LLM's response with Prometheus and GPT4 💯 |
|
Emerging |
| 23 |
ai-twinkle/Eval
Twinkle Eval:高效且準確的 AI 評測工具 |
|
Emerging |
| 24 |
IS2Lab/S-Eval
S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large... |
|
Emerging |
| 25 |
flexpa/llm-fhir-eval
Benchmarking Large Language Models for FHIR |
|
Emerging |
| 26 |
SeekingDream/Static-to-Dynamic-LLMEval
The official GitHub repository of the paper "Recent advances in large... |
|
Emerging |
| 27 |
FastEval/FastEval
Fast & more realistic evaluation of chat language models. Includes leaderboard. |
|
Emerging |
| 28 |
langwatch/langevals
LangEvals aggregates various language model evaluators into a single... |
|
Emerging |
| 29 |
VikhrModels/ru_llm_arena
Modified Arena-Hard-Auto LLM evaluation toolkit with an emphasis on Russian language |
|
Emerging |
| 30 |
namin/llm-verified-with-monte-carlo-tree-search
LLM verified with Monte Carlo Tree Search |
|
Emerging |
| 31 |
alopatenko/LLMEvaluation
A comprehensive guide to LLM evaluation methods designed to assist in... |
|
Emerging |
| 32 |
root-signals/scorable-sdk
Scorable SDK |
|
Emerging |
| 33 |
IAAR-Shanghai/UHGEval
[ACL 2024] User-friendly evaluation framework: Eval Suite & Benchmarks:... |
|
Emerging |
| 34 |
mims-harvard/Qworld
Qworld: Question-Specific Evaluation Criteria for LLMs |
|
Emerging |
| 35 |
OpenDCAI/One-Eval
Automated system for LLM evaluation via agents. |
|
Emerging |
| 36 |
wgryc/phasellm
Large language model evaluation and workflow framework from Phase AI. |
|
Emerging |
| 37 |
superagent-ai/poker-eval
A comprehensive tool for assessing AI Agents performance in simulated poker... |
|
Emerging |
| 38 |
lmarena/search-arena
⚔️ [ICLR 2026] Official code of "Search Arena: Analyzing Search-Augmented LLMs". |
|
Emerging |
| 39 |
multinear/multinear
Develop reliable AI apps |
|
Emerging |
| 40 |
HiThink-Research/GAGE
General AI evaluation and Gauge Engine. A unified evaluation engine for... |
|
Emerging |
| 41 |
terryyz/ice-score
[EACL 2024] ICE-Score: Instructing Large Language Models to Evaluate Code |
|
Emerging |
| 42 |
MLGroupJLU/LLM-eval-survey
The official GitHub page for the survey paper "A Survey on Evaluation of... |
|
Emerging |
| 43 |
mlchrzan/pairadigm
Concept-Guided Chain-of-Thought (CGCoT) pairwise annotation tool for... |
|
Emerging |
| 44 |
ZeroSumEval/ZeroSumEval
A framework for pitting LLMs against each other in an evolving library of games ⚔ |
|
Emerging |
| 45 |
gordicaleksa/serbian-llm-eval
Serbian LLM Eval. |
|
Emerging |
| 46 |
sileod/llm-theory-of-mind
Testing Theory of Mind (ToM) in language models with epistemic logic |
|
Emerging |
| 47 |
CS-EVAL/CS-Eval
CS-Eval is a comprehensive evaluation suite for fundamental cybersecurity... |
|
Emerging |
| 48 |
Re-Align/just-eval
A simple GPT-based evaluation tool for multi-aspect, interpretable... |
|
Emerging |
| 49 |
IAAR-Shanghai/GuessArena
[ACL 2025] GuessArena: Guess Who I Am? A Self-Adaptive Framework for... |
|
Emerging |
| 50 |
Cohere-Labs/multilingual-llm-evaluation-checklist
mLLM evaluation checklist |
|
Emerging |
| 51 |
Contextualist/lone-arena
Self-hosted LLM chatbot arena, with yourself as the only judge |
|
Emerging |
| 52 |
sinanuozdemir/oreilly-evaluating-llms
Metrics, Benchmarks, and Practical Tools for Assessing Large Language Models |
|
Emerging |
| 53 |
RGGH/evaluate
Evaluate - The Robust LLM Testing Framework 🦀 |
|
Emerging |
| 54 |
zorse-project/COBOLEval
Evaluate LLM-generated COBOL |
|
Emerging |
| 55 |
epam/ai-dial-rag-eval
A python library designed for RAG (Retrieval-Augmented Generation)... |
|
Emerging |
| 56 |
AMDResearch/NPUEval
NPUEval is an LLM evaluation dataset written specifically to target AIE... |
|
Emerging |
| 57 |
Azure-Samples/llm-eval-grader-samples
Framework for Post-production Evaluation of LLM based ChatBots |
|
Emerging |
| 58 |
ElevenLiy/MATEval
MATEval is the first multi-agent framework simulating human collaborative... |
|
Emerging |
| 59 |
allenai/CommonGen-Eval
Evaluating LLMs with CommonGen-Lite |
|
Emerging |
| 60 |
mags0ft/hle-eval-ollama
An easy-to-use evaluation tool for running Humanity's Last Exam on (locally)... |
|
Emerging |
| 61 |
kaistAI/FLASK
[ICLR 2024 Spotlight] FLASK: Fine-grained Language Model Evaluation based on... |
|
Emerging |
| 62 |
vienneraphael/layton-eval
layton-eval is an AI eval benchmark for divergent, out-of-the-box and... |
|
Emerging |
| 63 |
evalkit/evalkit
The TypeScript LLM Evaluation Library |
|
Emerging |
| 64 |
GAI-Community/GraphOmni
Enable Comprehensive LLM Evaluation on Graph Reasoning |
|
Emerging |
| 65 |
hitz-zentroa/latxa
Latxa: An Open Language Model and Evaluation Suite for Basque |
|
Experimental |
| 66 |
deshwalmahesh/PHUDGE
Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your... |
|
Experimental |
| 67 |
IngestAI/deepmark
Deepmark AI enables a unique testing environment for language models (LLM)... |
|
Experimental |
| 68 |
linhaowei1/kumo
☁️ KUMO: Generative Evaluation of Complex Reasoning in Large Language Models |
|
Experimental |
| 69 |
Humanity-s-Last-Code-Exam/HLCE
(EMNLP 2025 Findings) Source Evaluation scripts for Humanity's Last Code Exam |
|
Experimental |
| 70 |
jacobkandel/llm-content-moderation-analysis
Open-Source benchmark tracking LLM censorship and content moderation bias... |
|
Experimental |
| 71 |
mankinds/mankinds-eval
Open-source Python library for evaluating AI systems |
|
Experimental |
| 72 |
prorok9898/ERR-EVAL
🔍 Evaluate AI models' ability to detect ambiguity and manage uncertainty... |
|
Experimental |
| 73 |
Aysnc-Labs/llm-eval
A PHP package for evaluating LLM outputs. Test your prompts, validate... |
|
Experimental |
| 74 |
spenceryonce/LLMeval
Evaluate and compare large language models (LLMs) for chatbot applications,... |
|
Experimental |
| 75 |
AtomEcho/AtomBulb
旨在对当前主流LLM进行一个直观、具体、标准的评测 |
|
Experimental |
| 76 |
Goodeye-Labs/truesight-docs
Official documentation for Truesight — an AI evaluation platform for scoring... |
|
Experimental |
| 77 |
claw-eval/claw-eval
Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks... |
|
Experimental |
| 78 |
hparreao/Awesome-AI-Evaluation-Guide
A comprehensive, implementation-focused guide to evaluating Large Language... |
|
Experimental |
| 79 |
hnshah/verdict
LLM eval framework. Compare any model via OpenAI-compatible API. |
|
Experimental |
| 80 |
psandhaas/evaLLM
QA framework for evaluating LLM outputs based on user-defined metrics |
|
Experimental |
| 81 |
Linlichinese/rail-score
🚀 Enable accurate assessment of AI models with the RAIL Score Python SDK,... |
|
Experimental |
| 82 |
LeonEricsson/llmjudge
Exploring limitations of LLM-as-a-judge |
|
Experimental |
| 83 |
CSLiJT/awesome-lm-evaluation-methodologies
Frontier papers in the evaluation methodologies of language models. |
|
Experimental |
| 84 |
jaaack-wang/multi-problem-eval-llm
Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing... |
|
Experimental |
| 85 |
framersai/promptmachine-eval
LLM evaluation framework with ELO ratings, arena battles, and benchmark testing |
|
Experimental |
| 86 |
JinjieNi/MixEval-X
The official github repo for MixEval-X, the first any-to-any, real-world benchmark. |
|
Experimental |
| 87 |
Yifan-Song793/GoodBadGreedy
The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore... |
|
Experimental |
| 88 |
vakyansh/truthfulqa_indic
Truthfulqa_indic, Available in Hindi, Punjabi, Kannada, Tamil and Telugu |
|
Experimental |
| 89 |
arjunpatel7/alakazam-vgc
An LLM powered speed check assistant for Pokemon VGC Players |
|
Experimental |
| 90 |
crux82/wikigame-llm-eval
Companion repo for CLiC-it 2025 paper on WikiGame. Reproducible pipeline to... |
|
Experimental |
| 91 |
lechmazur/sycophancy
LLM benchmark and leaderboard for narrator-bias sycophancy,... |
|
Experimental |
| 92 |
Vibhanshu-555/Human-Aligned-LLM-Evaluation-Audit
A data-driven audit of AI judge reliability using MT-Bench human... |
|
Experimental |
| 93 |
thisguymartin/burro
Burro is a command-line interface (CLI) tool built with Deno for evaluating... |
|
Experimental |
| 94 |
BhuvanDontha/YouTube-policy-enforcement-auditor
Independent YouTube evaluation framework for content policy classification.... |
|
Experimental |
| 95 |
Duguce/GuessArena-Demo
A web-based interactive demo for the GuessArena evaluation framework |
|
Experimental |
| 96 |
IyatomiLab/extended-jcm
MTLE implementation and the eJCM dataset, an extension of JCommonsenseMorality. |
|
Experimental |
| 97 |
MukundaKatta/redpill
The Red Pill Test — Can LLMs recognize the boundaries of their own reality?... |
|
Experimental |
| 98 |
djador13/moderatefocus
🔍 Analyze community moderation and platform policies with the ModerateFocus... |
|
Experimental |
| 99 |
Theepankumargandhi/llm-annotation-quality-pipeline
Production-grade pipeline for validating annotation consistency and... |
|
Experimental |
| 100 |
DavidShableski/llm-evaluation-framework
A production-grade platform to evaluate and compare the performance of Large... |
|
Experimental |
| 101 |
The-Learning-Algorithm/ai-judge-pipeline
A comprehensive pipeline for generating, analyzing, and evaluating models... |
|
Experimental |
| 102 |
horde-research/Kaz-Offline-Arena
Offline LLM evaluation pipeline for Kazakh: run local HF models, auto-judge,... |
|
Experimental |
| 103 |
Paulyang80/LLMEvaluation-A100-vs-V100-
Found out that using A100 and V100 on Vicuna and Llama2 have a different... |
|
Experimental |