LLM Comparison Evaluation LLM Tools

Tools for comparing LLM outputs, benchmarking performance across multiple models, and evaluating LLM quality on specific tasks. Does NOT include general LLM evaluation frameworks, prompt engineering resources, or single-model testing tools.

There are 82 llm comparison evaluation tools tracked. 1 score above 70 (verified tier). The highest-rated is open-compass/opencompass at 73/100 with 6,752 stars. 1 of the top 10 are actively maintained.

Get all 82 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=llm-comparison-evaluation&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 open-compass/opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models...

73
Verified
2 IBM/unitxt

🦄 Unitxt is a Python library for enterprise-grade evaluation of AI...

69
Established
3 lean-dojo/LeanDojo

Tool for data extraction and interacting with Lean programmatically.

57
Established
4 GoodStartLabs/AI_Diplomacy

Frontier Models playing the board game Diplomacy.

56
Established
5 google/litmus

Litmus is a comprehensive LLM testing and evaluation tool designed for GenAI...

50
Established
6 salesforce/CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation

49
Emerging
7 NatLabRockies/COMPASS

INFRA-COMPASS is a tool that leverages Large Language Models (LLMs) to...

49
Emerging
8 JonathanChavezTamales/llm-leaderboard

A comprehensive set of LLM benchmark scores and provider prices....

49
Emerging
9 namin/dafny-sketcher

piggybacking on the Dafny language implementation to explore interactive...

48
Emerging
10 rpjayaraman/RTL2UVM

Automated UVM testbench generator from Verilog RTL with optional LLM...

47
Emerging
11 MigoXLab/LMeterX

A general-purpose API load testing platform that supports LLM services and...

47
Emerging
12 599yongyang/DatasetLoom

一个面向多模态大模型训练的智能数据集构建与评估平台

46
Emerging
13 v7labs/benchllm

Continuous Integration for LLM powered applications

46
Emerging
14 lean-dojo/LeanDojoWebsite

Code for LeanDojo's website

44
Emerging
15 artas728/spelltest

AI-to-AI Testing | Simulation framework for LLM-based applications

44
Emerging
16 LudwigStumpp/llm-leaderboard

A joint community effort to create one central leaderboard for LLMs.

43
Emerging
17 Supahands/llm-comparison-backend

This is an opensource project allowing you to compare two LLM's head to head...

43
Emerging
18 NikolasEnt/ollama-webui-intel

Ollama with intel (i)GPU acceleration in docker and benchmark

42
Emerging
19 vertbera/beyond-the-mirror

Field research exposing how LLM safeguards collapse under polite, persistent...

42
Emerging
20 Vvkmnn/awesome-ai-eval

☑️ A curated list of tools, methods & platforms for evaluating AI...

42
Emerging
21 NOVADEDOG/energy-leaderboard-runner

Open-source energy benchmark for local LLMs. Measures Wh and CO2 using real...

41
Emerging
22 flashclub/ModelJudge

这是一个基于 Next.js 构建的多语言 AI 模型评估平台,支持多模型对比和实时流式响应。A multilingual AI model...

39
Emerging
23 empirical-run/empirical

Test and evaluate LLMs and model configurations, across all the scenarios...

38
Emerging
24 sealambda/unit-text

Unit tests for plain text - LLM as a copy editor

37
Emerging
25 nexmoe/lm-speed

Help developers optimize AI application performance through comprehensive...

37
Emerging
26 amirdeljouyi/UTGen

Replication package of the ICSE2025 paper titled "Leveraging Large Language...

35
Emerging
27 jpreagan/llmnop

A tool for measuring LLM performance metrics.

35
Emerging
28 dmeldrum6/LLM-Diff-Tool

Application for comparing responses from different Large Language Models...

35
Emerging
29 yinxulai/ait

批量测试符合 OpenAI 协议和 Anthropic 协议的 AI 模型性能指标。支持...

34
Emerging
30 jonathanmli/Avalon-LLM

This repository contains a LLM benchmark for the social deduction game...

33
Emerging
31 LAVA-LAB/COOL-MC

The interface between probabilistic model checking and data-driven policy learning.

33
Emerging
32 geminimir/promptproof-action

Deterministic LLM contract checks for CI. Replays recorded fixtures,...

33
Emerging
33 jordicor/GranSabio_LLM

Multi-Layer AI Quality Assurance for Content Generation. Multiple LLMs...

33
Emerging
34 lmg-anon/rp-test-framework

LLM Roleplay Test Framework

32
Emerging
35 georgeguimaraes/alike

Semantic similarity testing for Elixir. Test LLM outputs, chatbots, and NLP in Elixir

32
Emerging
36 stashlabs/duelr

Compare LLMs in one click

32
Emerging
37 dsdanielpark/open-llm-leaderboard-report

Weekly visualization report of Open LLM model performance based on 4 metrics.

31
Emerging
38 Supahands/llm-comparison

This is an opensource project allowing you to compare two LLM's head to head...

30
Emerging
39 Skripkon/llm_trainer

🤖 Train and evaluate LLMs with ease and fun 🦾

30
Emerging
40 AGBAJEMUH/Awesome-AI-Evaluation-Guide

🤖 Evaluate AI systems effectively with our comprehensive guide to methods,...

29
Experimental
41 Yuyz0112/relia

Find the Best LLM for Your Needs through E2E Testing

29
Experimental
42 wafer-ai/chipbenchmark

a platform for monitoring the chip situation

29
Experimental
43 josephpaulgiroux/ai_categories

Lets AI Language Models compete in a game of AI Categories (similar to...

28
Experimental
44 ArslanKAS/Quality-and-Safety-for-LLM-Applications

Explore new metrics and best practices to monitor your LLM systems and...

28
Experimental
45 ccarvalho-eng/aludel

LLM Evaluation Workbench

27
Experimental
46 tianzhaotju/EMD

Replication Package for "Large Language Models for Equivalent Mutant...

27
Experimental
47 LeonYang95/LLM4UT

Evaluation code of ASE24 accepted paper "On the Evaluation of LLM in Unit...

27
Experimental
48 shmercer/pairwiseLLM

R Package: Pairwise Comparison Tools for LLM-Based Writing Evaluation

26
Experimental
49 albertdobmeyer/cobol-legacy-ledger

Learn COBOL through a live banking system — 18 programs, 6-node settlement...

26
Experimental
50 kalilurrahman/QualityEngineeringBookByLLMs

Quality Engineering book authored with LLM assistance — exploring modern QE...

24
Experimental
51 mich1803/Codenames-LLM

Building an AI team to play Codenames using top Large Language Models...

23
Experimental
52 ruankie/langfuse-monitoring-eval

Monitoring and evaluating LLM apps with Langfuse. Presented at PyConZA 2024.

23
Experimental
53 Amir-Mohseni/AI-Response-Evaluation

A comprehensive framework to evaluate the quality of AI-generated responses,...

23
Experimental
54 hongping-zh/ecocompute-ai

🔋 RTX 5090 energy benchmark suite for LLMs — real NVML power data, not estimates

23
Experimental
55 ellmos-ai/ellmos-tests

Testing framework for LLM operating systems - B/O/E test methodology

22
Experimental
56 piyushgupta344/llm-test-harness

Deterministic testing framework for LLM-powered apps — record/replay...

22
Experimental
57 Templum/aoide

A TypeScript testing framework for LLM-powered applications. Write tests...

22
Experimental
58 augustocristian/llm-testing-roadmap-rp

Replication package of the artickle: "A Research Roadmap on the Usage of...

22
Experimental
59 heyqule/evangelion_magi

evangelion magi decision system that links 3 LLM models.

22
Experimental
60 KooshaPari/kwality

🧠 LLM Validation Platform: Advanced testing frameworks with DeepEval,...

21
Experimental
61 sanand0/hypoforge

Use LLMs to analyze any dataset, create hypotheses from those, test the...

20
Experimental
62 athina-ai/athina-sdk

LLM Testing SDK that helps you write and run tests to monitor your LLM app...

20
Experimental
63 ml-energy/leaderboard

How much time and energy do modern generative AI models consume?

19
Experimental
64 chiragpadyal/AutoTestGen

Automatic Unit Test Generation Testing Suite using LLM as a Visual Studio...

19
Experimental
65 broskees/llm-compare

LLM benchmark comparison tool

18
Experimental
66 waldekmastykarz/openai-compare

Compare the effectiveness of LLMs using OpenAI-compatible APIs

18
Experimental
67 aiqualitylab/llm-qa-assistant

Compare and validate QA tasks using 3 local (Ollama) or cloud (Groq API)...

18
Experimental
68 RedKnight-aj/ai-testing-framework

AI Testing Framework using DeepEval - Quality assurance for LLM applications

14
Experimental
69 seeshuraj/llm-test-lab

🧪 Evaluate, score, and compare LLM outputs before your users do. Automated...

14
Experimental
70 Maik425/promptdiff

Compare LLM outputs across models. One API call. Supports Claude, GPT, Gemini, Grok.

14
Experimental
71 rololevy/debate-IA-politica-argentina

A debate between two fine-tuned LLMs

14
Experimental
72 Strawhat404/wb77i-optimizing-high-throughput-chat-message-aggregation

A sample Dataset for AI training to showcase the LLM Benchmarking of...

13
Experimental
73 giis-uniovi/retorch-llm-rp

Replication package for LLM System testing experimentation

13
Experimental
74 ceccon-t/LicLacMoe

Play tic-tac-toe against a local LLM model.

13
Experimental
75 wyt2000/CodeV-SVA

[DAC 2026] QiMeng-CodeV-SVA: Training Specialized LLMs for Hardware...

13
Experimental
76 agent-sh/perf

Rigorous performance investigation workflow with baselines, profiling, and...

13
Experimental
77 sohambpatel/TestBedGenerator

Creating the test beds with the help of chatgpt, in house LLM OLLAMA and...

13
Experimental
78 colingalbraith/Accoutre

Accoutre aims to equip SLMs with tools and measure the gains - A zero-build...

13
Experimental
79 AI4InclusiveDeliberation/inclusive_deliberation_llm

Empowering Inclusive E-Deliberation by Harnessing Collective Wisdom and...

13
Experimental
80 SevdanurGENC/LLM-Based-Unit-Test-Generator

Automated unit test generation and evaluation using generative AI (GPT-4)

13
Experimental
81 Intelligent-CAT-Lab/FlakyDoctor

Artifact repository for the paper "Neurosymbolic Repair of Test Flakiness",...

12
Experimental
82 juandspy/llm-assert

Use any LLM model to assert unit tests for you

11
Experimental