LLM Comparison Evaluation LLM Tools

Tools for comparing LLM outputs, benchmarking performance across multiple models, and evaluating LLM quality on specific tasks. Does NOT include general LLM evaluation frameworks, prompt engineering resources, or single-model testing tools.

There are 82 llm comparison evaluation tools tracked. 1 score above 70 (verified tier). The highest-rated is open-compass/opencompass at 73/100 with 6,752 stars. 1 of the top 10 are actively maintained.

Get all 82 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=llm-comparison-evaluation&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	open-compass/opencompass OpenCompass is an LLM evaluation platform, supporting a wide range of models...	73	Verified	6,752	Python
2	IBM/unitxt 🦄 Unitxt is a Python library for enterprise-grade evaluation of AI...	69	Established	211	Python
3	lean-dojo/LeanDojo Tool for data extraction and interacting with Lean programmatically.	57	Established	778	Python
4	GoodStartLabs/AI_Diplomacy Frontier Models playing the board game Diplomacy.	56	Established	634	Python
5	google/litmus Litmus is a comprehensive LLM testing and evaluation tool designed for GenAI...	50	Established	45	Vue
6	salesforce/CodeT5 Home of CodeT5: Open Code LLMs for Code Understanding and Generation	49	Emerging	3,098	Python
7	NatLabRockies/COMPASS INFRA-COMPASS is a tool that leverages Large Language Models (LLMs) to...	49	Emerging	15	Python
8	JonathanChavezTamales/llm-leaderboard A comprehensive set of LLM benchmark scores and provider prices....	49	Emerging	362	JavaScript
9	namin/dafny-sketcher piggybacking on the Dafny language implementation to explore interactive...	48	Emerging	16	Dafny
10	rpjayaraman/RTL2UVM Automated UVM testbench generator from Verilog RTL with optional LLM...	47	Emerging	19	SystemVerilog
11	MigoXLab/LMeterX A general-purpose API load testing platform that supports LLM services and...	47	Emerging	179	Python
12	599yongyang/DatasetLoom 一个面向多模态大模型训练的智能数据集构建与评估平台	46	Emerging	270	TypeScript
13	v7labs/benchllm Continuous Integration for LLM powered applications	46	Emerging	254	Python
14	lean-dojo/LeanDojoWebsite Code for LeanDojo's website	44	Emerging	8	HTML
15	artas728/spelltest AI-to-AI Testing \| Simulation framework for LLM-based applications	44	Emerging	136	Python
16	LudwigStumpp/llm-leaderboard A joint community effort to create one central leaderboard for LLMs.	43	Emerging	307	Python
17	Supahands/llm-comparison-backend This is an opensource project allowing you to compare two LLM's head to head...	43	Emerging	22	Python
18	NikolasEnt/ollama-webui-intel Ollama with intel (i)GPU acceleration in docker and benchmark	42	Emerging	41	Python
19	vertbera/beyond-the-mirror Field research exposing how LLM safeguards collapse under polite, persistent...	42	Emerging	2	Python
20	Vvkmnn/awesome-ai-eval ☑️ A curated list of tools, methods & platforms for evaluating AI...	42	Emerging	66	—
21	NOVADEDOG/energy-leaderboard-runner Open-source energy benchmark for local LLMs. Measures Wh and CO2 using real...	41	Emerging	6	TypeScript
22	flashclub/ModelJudge 这是一个基于 Next.js 构建的多语言 AI 模型评估平台，支持多模型对比和实时流式响应。A multilingual AI model...	39	Emerging	95	TypeScript
23	empirical-run/empirical Test and evaluate LLMs and model configurations, across all the scenarios...	38	Emerging	167	TypeScript
24	sealambda/unit-text Unit tests for plain text - LLM as a copy editor	37	Emerging	41	Python
25	nexmoe/lm-speed Help developers optimize AI application performance through comprehensive...	37	Emerging	79	TypeScript
26	amirdeljouyi/UTGen Replication package of the ICSE2025 paper titled "Leveraging Large Language...	35	Emerging	11	Java
27	jpreagan/llmnop A tool for measuring LLM performance metrics.	35	Emerging	9	Rust
28	dmeldrum6/LLM-Diff-Tool Application for comparing responses from different Large Language Models...	35	Emerging	6	HTML
29	yinxulai/ait 批量测试符合 OpenAI 协议和 Anthropic 协议的 AI 模型性能指标。支持...	34	Emerging	50	Go
30	jonathanmli/Avalon-LLM This repository contains a LLM benchmark for the social deduction game...	33	Emerging	141	Python
31	LAVA-LAB/COOL-MC The interface between probabilistic model checking and data-driven policy learning.	33	Emerging	16	Python
32	geminimir/promptproof-action Deterministic LLM contract checks for CI. Replays recorded fixtures,...	33	Emerging	13	HTML
33	jordicor/GranSabio_LLM Multi-Layer AI Quality Assurance for Content Generation. Multiple LLMs...	33	Emerging	3	Python
34	lmg-anon/rp-test-framework LLM Roleplay Test Framework	32	Emerging	5	Python
35	georgeguimaraes/alike Semantic similarity testing for Elixir. Test LLM outputs, chatbots, and NLP in Elixir	32	Emerging	41	Elixir
36	stashlabs/duelr Compare LLMs in one click	32	Emerging	39	TypeScript
37	dsdanielpark/open-llm-leaderboard-report Weekly visualization report of Open LLM model performance based on 4 metrics.	31	Emerging	86	Python
38	Supahands/llm-comparison This is an opensource project allowing you to compare two LLM's head to head...	30	Emerging	25	TypeScript
39	Skripkon/llm_trainer 🤖 Train and evaluate LLMs with ease and fun 🦾	30	Emerging	12	Python
40	AGBAJEMUH/Awesome-AI-Evaluation-Guide 🤖 Evaluate AI systems effectively with our comprehensive guide to methods,...	29	Experimental	3	—
41	Yuyz0112/relia Find the Best LLM for Your Needs through E2E Testing	29	Experimental	83	TypeScript
42	wafer-ai/chipbenchmark a platform for monitoring the chip situation	29	Experimental	15	Shell
43	josephpaulgiroux/ai_categories Lets AI Language Models compete in a game of AI Categories (similar to...	28	Experimental	2	Python
44	ArslanKAS/Quality-and-Safety-for-LLM-Applications Explore new metrics and best practices to monitor your LLM systems and...	28	Experimental	5	Jupyter Notebook
45	ccarvalho-eng/aludel LLM Evaluation Workbench	27	Experimental	9	JavaScript
46	tianzhaotju/EMD Replication Package for "Large Language Models for Equivalent Mutant...	27	Experimental	9	Python
47	LeonYang95/LLM4UT Evaluation code of ASE24 accepted paper "On the Evaluation of LLM in Unit...	27	Experimental	13	HTML
48	shmercer/pairwiseLLM R Package: Pairwise Comparison Tools for LLM-Based Writing Evaluation	26	Experimental	3	R
49	albertdobmeyer/cobol-legacy-ledger Learn COBOL through a live banking system — 18 programs, 6-node settlement...	26	Experimental	2	COBOL
50	kalilurrahman/QualityEngineeringBookByLLMs Quality Engineering book authored with LLM assistance — exploring modern QE...	24	Experimental	1	—
51	mich1803/Codenames-LLM Building an AI team to play Codenames using top Large Language Models...	23	Experimental	2	Jupyter Notebook
52	ruankie/langfuse-monitoring-eval Monitoring and evaluating LLM apps with Langfuse. Presented at PyConZA 2024.	23	Experimental	4	HTML
53	Amir-Mohseni/AI-Response-Evaluation A comprehensive framework to evaluate the quality of AI-generated responses,...	23	Experimental	3	Jupyter Notebook
54	hongping-zh/ecocompute-ai 🔋 RTX 5090 energy benchmark suite for LLMs — real NVML power data, not estimates	23	Experimental	2	JavaScript
55	ellmos-ai/ellmos-tests Testing framework for LLM operating systems - B/O/E test methodology	22	Experimental	—	Python
56	piyushgupta344/llm-test-harness Deterministic testing framework for LLM-powered apps — record/replay...	22	Experimental	—	TypeScript
57	Templum/aoide A TypeScript testing framework for LLM-powered applications. Write tests...	22	Experimental	—	TypeScript
58	augustocristian/llm-testing-roadmap-rp Replication package of the artickle: "A Research Roadmap on the Usage of...	22	Experimental	—	JavaScript
59	heyqule/evangelion_magi evangelion magi decision system that links 3 LLM models.	22	Experimental	—	JavaScript
60	KooshaPari/kwality 🧠 LLM Validation Platform: Advanced testing frameworks with DeepEval,...	21	Experimental	1	Makefile
61	sanand0/hypoforge Use LLMs to analyze any dataset, create hypotheses from those, test the...	20	Experimental	—	JavaScript
62	athina-ai/athina-sdk LLM Testing SDK that helps you write and run tests to monitor your LLM app...	20	Experimental	132	Python
63	ml-energy/leaderboard How much time and energy do modern generative AI models consume?	19	Experimental	5	TypeScript
64	chiragpadyal/AutoTestGen Automatic Unit Test Generation Testing Suite using LLM as a Visual Studio...	19	Experimental	3	TypeScript
65	broskees/llm-compare LLM benchmark comparison tool	18	Experimental	2	HTML
66	waldekmastykarz/openai-compare Compare the effectiveness of LLMs using OpenAI-compatible APIs	18	Experimental	1	Jupyter Notebook
67	aiqualitylab/llm-qa-assistant Compare and validate QA tasks using 3 local (Ollama) or cloud (Groq API)...	18	Experimental	1	JavaScript
68	RedKnight-aj/ai-testing-framework AI Testing Framework using DeepEval - Quality assurance for LLM applications	14	Experimental	—	Python
69	seeshuraj/llm-test-lab 🧪 Evaluate, score, and compare LLM outputs before your users do. Automated...	14	Experimental	—	TypeScript
70	Maik425/promptdiff Compare LLM outputs across models. One API call. Supports Claude, GPT, Gemini, Grok.	14	Experimental	—	TypeScript
71	rololevy/debate-IA-politica-argentina A debate between two fine-tuned LLMs	14	Experimental	—	Jupyter Notebook
72	Strawhat404/wb77i-optimizing-high-throughput-chat-message-aggregation A sample Dataset for AI training to showcase the LLM Benchmarking of...	13	Experimental	—	Go
73	giis-uniovi/retorch-llm-rp Replication package for LLM System testing experimentation	13	Experimental	—	Java
74	ceccon-t/LicLacMoe Play tic-tac-toe against a local LLM model.	13	Experimental	—	Java
75	wyt2000/CodeV-SVA [DAC 2026] QiMeng-CodeV-SVA: Training Specialized LLMs for Hardware...	13	Experimental	2	Python
76	agent-sh/perf Rigorous performance investigation workflow with baselines, profiling, and...	13	Experimental	—	JavaScript
77	sohambpatel/TestBedGenerator Creating the test beds with the help of chatgpt, in house LLM OLLAMA and...	13	Experimental	—	Java
78	colingalbraith/Accoutre Accoutre aims to equip SLMs with tools and measure the gains - A zero-build...	13	Experimental	—	JavaScript
79	AI4InclusiveDeliberation/inclusive_deliberation_llm Empowering Inclusive E-Deliberation by Harnessing Collective Wisdom and...	13	Experimental	—	Jupyter Notebook
80	SevdanurGENC/LLM-Based-Unit-Test-Generator Automated unit test generation and evaluation using generative AI (GPT-4)	13	Experimental	—	Jupyter Notebook
81	Intelligent-CAT-Lab/FlakyDoctor Artifact repository for the paper "Neurosymbolic Repair of Test Flakiness",...	12	Experimental	5	Python
82	juandspy/llm-assert Use any LLM model to assert unit tests for you	11	Experimental	4	Python

Comparisons in this category

opencompass and COMPASS (73 vs 49) LeanDojo and LeanDojoWebsite (57 vs 44)