LLM Comparison Evaluation LLM Tools
Tools for comparing LLM outputs, benchmarking performance across multiple models, and evaluating LLM quality on specific tasks. Does NOT include general LLM evaluation frameworks, prompt engineering resources, or single-model testing tools.
There are 82 llm comparison evaluation tools tracked. 1 score above 70 (verified tier). The highest-rated is open-compass/opencompass at 73/100 with 6,752 stars. 1 of the top 10 are actively maintained.
Get all 82 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=llm-comparison-evaluation&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
open-compass/opencompass
OpenCompass is an LLM evaluation platform, supporting a wide range of models... |
|
Verified |
| 2 |
IBM/unitxt
🦄 Unitxt is a Python library for enterprise-grade evaluation of AI... |
|
Established |
| 3 |
lean-dojo/LeanDojo
Tool for data extraction and interacting with Lean programmatically. |
|
Established |
| 4 |
GoodStartLabs/AI_Diplomacy
Frontier Models playing the board game Diplomacy. |
|
Established |
| 5 |
google/litmus
Litmus is a comprehensive LLM testing and evaluation tool designed for GenAI... |
|
Established |
| 6 |
salesforce/CodeT5
Home of CodeT5: Open Code LLMs for Code Understanding and Generation |
|
Emerging |
| 7 |
NatLabRockies/COMPASS
INFRA-COMPASS is a tool that leverages Large Language Models (LLMs) to... |
|
Emerging |
| 8 |
JonathanChavezTamales/llm-leaderboard
A comprehensive set of LLM benchmark scores and provider prices.... |
|
Emerging |
| 9 |
namin/dafny-sketcher
piggybacking on the Dafny language implementation to explore interactive... |
|
Emerging |
| 10 |
rpjayaraman/RTL2UVM
Automated UVM testbench generator from Verilog RTL with optional LLM... |
|
Emerging |
| 11 |
MigoXLab/LMeterX
A general-purpose API load testing platform that supports LLM services and... |
|
Emerging |
| 12 |
599yongyang/DatasetLoom
一个面向多模态大模型训练的智能数据集构建与评估平台 |
|
Emerging |
| 13 |
v7labs/benchllm
Continuous Integration for LLM powered applications |
|
Emerging |
| 14 |
lean-dojo/LeanDojoWebsite
Code for LeanDojo's website |
|
Emerging |
| 15 |
artas728/spelltest
AI-to-AI Testing | Simulation framework for LLM-based applications |
|
Emerging |
| 16 |
LudwigStumpp/llm-leaderboard
A joint community effort to create one central leaderboard for LLMs. |
|
Emerging |
| 17 |
Supahands/llm-comparison-backend
This is an opensource project allowing you to compare two LLM's head to head... |
|
Emerging |
| 18 |
NikolasEnt/ollama-webui-intel
Ollama with intel (i)GPU acceleration in docker and benchmark |
|
Emerging |
| 19 |
vertbera/beyond-the-mirror
Field research exposing how LLM safeguards collapse under polite, persistent... |
|
Emerging |
| 20 |
Vvkmnn/awesome-ai-eval
☑️ A curated list of tools, methods & platforms for evaluating AI... |
|
Emerging |
| 21 |
NOVADEDOG/energy-leaderboard-runner
Open-source energy benchmark for local LLMs. Measures Wh and CO2 using real... |
|
Emerging |
| 22 |
flashclub/ModelJudge
这是一个基于 Next.js 构建的多语言 AI 模型评估平台,支持多模型对比和实时流式响应。A multilingual AI model... |
|
Emerging |
| 23 |
empirical-run/empirical
Test and evaluate LLMs and model configurations, across all the scenarios... |
|
Emerging |
| 24 |
sealambda/unit-text
Unit tests for plain text - LLM as a copy editor |
|
Emerging |
| 25 |
nexmoe/lm-speed
Help developers optimize AI application performance through comprehensive... |
|
Emerging |
| 26 |
amirdeljouyi/UTGen
Replication package of the ICSE2025 paper titled "Leveraging Large Language... |
|
Emerging |
| 27 |
jpreagan/llmnop
A tool for measuring LLM performance metrics. |
|
Emerging |
| 28 |
dmeldrum6/LLM-Diff-Tool
Application for comparing responses from different Large Language Models... |
|
Emerging |
| 29 |
yinxulai/ait
批量测试符合 OpenAI 协议和 Anthropic 协议的 AI 模型性能指标。支持... |
|
Emerging |
| 30 |
jonathanmli/Avalon-LLM
This repository contains a LLM benchmark for the social deduction game... |
|
Emerging |
| 31 |
LAVA-LAB/COOL-MC
The interface between probabilistic model checking and data-driven policy learning. |
|
Emerging |
| 32 |
geminimir/promptproof-action
Deterministic LLM contract checks for CI. Replays recorded fixtures,... |
|
Emerging |
| 33 |
jordicor/GranSabio_LLM
Multi-Layer AI Quality Assurance for Content Generation. Multiple LLMs... |
|
Emerging |
| 34 |
lmg-anon/rp-test-framework
LLM Roleplay Test Framework |
|
Emerging |
| 35 |
georgeguimaraes/alike
Semantic similarity testing for Elixir. Test LLM outputs, chatbots, and NLP in Elixir |
|
Emerging |
| 36 |
stashlabs/duelr
Compare LLMs in one click |
|
Emerging |
| 37 |
dsdanielpark/open-llm-leaderboard-report
Weekly visualization report of Open LLM model performance based on 4 metrics. |
|
Emerging |
| 38 |
Supahands/llm-comparison
This is an opensource project allowing you to compare two LLM's head to head... |
|
Emerging |
| 39 |
Skripkon/llm_trainer
🤖 Train and evaluate LLMs with ease and fun 🦾 |
|
Emerging |
| 40 |
AGBAJEMUH/Awesome-AI-Evaluation-Guide
🤖 Evaluate AI systems effectively with our comprehensive guide to methods,... |
|
Experimental |
| 41 |
Yuyz0112/relia
Find the Best LLM for Your Needs through E2E Testing |
|
Experimental |
| 42 |
wafer-ai/chipbenchmark
a platform for monitoring the chip situation |
|
Experimental |
| 43 |
josephpaulgiroux/ai_categories
Lets AI Language Models compete in a game of AI Categories (similar to... |
|
Experimental |
| 44 |
ArslanKAS/Quality-and-Safety-for-LLM-Applications
Explore new metrics and best practices to monitor your LLM systems and... |
|
Experimental |
| 45 |
ccarvalho-eng/aludel
LLM Evaluation Workbench |
|
Experimental |
| 46 |
tianzhaotju/EMD
Replication Package for "Large Language Models for Equivalent Mutant... |
|
Experimental |
| 47 |
LeonYang95/LLM4UT
Evaluation code of ASE24 accepted paper "On the Evaluation of LLM in Unit... |
|
Experimental |
| 48 |
shmercer/pairwiseLLM
R Package: Pairwise Comparison Tools for LLM-Based Writing Evaluation |
|
Experimental |
| 49 |
albertdobmeyer/cobol-legacy-ledger
Learn COBOL through a live banking system — 18 programs, 6-node settlement... |
|
Experimental |
| 50 |
kalilurrahman/QualityEngineeringBookByLLMs
Quality Engineering book authored with LLM assistance — exploring modern QE... |
|
Experimental |
| 51 |
mich1803/Codenames-LLM
Building an AI team to play Codenames using top Large Language Models... |
|
Experimental |
| 52 |
ruankie/langfuse-monitoring-eval
Monitoring and evaluating LLM apps with Langfuse. Presented at PyConZA 2024. |
|
Experimental |
| 53 |
Amir-Mohseni/AI-Response-Evaluation
A comprehensive framework to evaluate the quality of AI-generated responses,... |
|
Experimental |
| 54 |
hongping-zh/ecocompute-ai
🔋 RTX 5090 energy benchmark suite for LLMs — real NVML power data, not estimates |
|
Experimental |
| 55 |
ellmos-ai/ellmos-tests
Testing framework for LLM operating systems - B/O/E test methodology |
|
Experimental |
| 56 |
piyushgupta344/llm-test-harness
Deterministic testing framework for LLM-powered apps — record/replay... |
|
Experimental |
| 57 |
Templum/aoide
A TypeScript testing framework for LLM-powered applications. Write tests... |
|
Experimental |
| 58 |
augustocristian/llm-testing-roadmap-rp
Replication package of the artickle: "A Research Roadmap on the Usage of... |
|
Experimental |
| 59 |
heyqule/evangelion_magi
evangelion magi decision system that links 3 LLM models. |
|
Experimental |
| 60 |
KooshaPari/kwality
🧠 LLM Validation Platform: Advanced testing frameworks with DeepEval,... |
|
Experimental |
| 61 |
sanand0/hypoforge
Use LLMs to analyze any dataset, create hypotheses from those, test the... |
|
Experimental |
| 62 |
athina-ai/athina-sdk
LLM Testing SDK that helps you write and run tests to monitor your LLM app... |
|
Experimental |
| 63 |
ml-energy/leaderboard
How much time and energy do modern generative AI models consume? |
|
Experimental |
| 64 |
chiragpadyal/AutoTestGen
Automatic Unit Test Generation Testing Suite using LLM as a Visual Studio... |
|
Experimental |
| 65 |
broskees/llm-compare
LLM benchmark comparison tool |
|
Experimental |
| 66 |
waldekmastykarz/openai-compare
Compare the effectiveness of LLMs using OpenAI-compatible APIs |
|
Experimental |
| 67 |
aiqualitylab/llm-qa-assistant
Compare and validate QA tasks using 3 local (Ollama) or cloud (Groq API)... |
|
Experimental |
| 68 |
RedKnight-aj/ai-testing-framework
AI Testing Framework using DeepEval - Quality assurance for LLM applications |
|
Experimental |
| 69 |
seeshuraj/llm-test-lab
🧪 Evaluate, score, and compare LLM outputs before your users do. Automated... |
|
Experimental |
| 70 |
Maik425/promptdiff
Compare LLM outputs across models. One API call. Supports Claude, GPT, Gemini, Grok. |
|
Experimental |
| 71 |
rololevy/debate-IA-politica-argentina
A debate between two fine-tuned LLMs |
|
Experimental |
| 72 |
Strawhat404/wb77i-optimizing-high-throughput-chat-message-aggregation
A sample Dataset for AI training to showcase the LLM Benchmarking of... |
|
Experimental |
| 73 |
giis-uniovi/retorch-llm-rp
Replication package for LLM System testing experimentation |
|
Experimental |
| 74 |
ceccon-t/LicLacMoe
Play tic-tac-toe against a local LLM model. |
|
Experimental |
| 75 |
wyt2000/CodeV-SVA
[DAC 2026] QiMeng-CodeV-SVA: Training Specialized LLMs for Hardware... |
|
Experimental |
| 76 |
agent-sh/perf
Rigorous performance investigation workflow with baselines, profiling, and... |
|
Experimental |
| 77 |
sohambpatel/TestBedGenerator
Creating the test beds with the help of chatgpt, in house LLM OLLAMA and... |
|
Experimental |
| 78 |
colingalbraith/Accoutre
Accoutre aims to equip SLMs with tools and measure the gains - A zero-build... |
|
Experimental |
| 79 |
AI4InclusiveDeliberation/inclusive_deliberation_llm
Empowering Inclusive E-Deliberation by Harnessing Collective Wisdom and... |
|
Experimental |
| 80 |
SevdanurGENC/LLM-Based-Unit-Test-Generator
Automated unit test generation and evaluation using generative AI (GPT-4) |
|
Experimental |
| 81 |
Intelligent-CAT-Lab/FlakyDoctor
Artifact repository for the paper "Neurosymbolic Repair of Test Flakiness",... |
|
Experimental |
| 82 |
juandspy/llm-assert
Use any LLM model to assert unit tests for you |
|
Experimental |