LLM Evaluation Frameworks Prompt Engineering Tools

Systematic benchmarking and testing suites for evaluating LLM prompt strategies, output quality, consistency, and factuality across multiple models and tasks. Does NOT include prompt optimization tools, hallucination-reduction techniques alone, or general LLM deployment platforms.

There are 100 llm evaluation frameworks tools tracked. 2 score above 50 (established tier). The highest-rated is microsoft/promptbench at 64/100 with 2,785 stars.

Get all 100 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=prompt-engineering&subcategory=llm-evaluation-frameworks&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 microsoft/promptbench

A unified evaluation framework for large language models

64
Established
2 uptrain-ai/uptrain

UpTrain is an open-source unified platform to evaluate and improve...

54
Established
3 levitation-opensource/Manipulative-Expression-Recognition

MER is a software that identifies and highlights manipulative communication...

45
Emerging
4 microsoftarchive/promptbench

A unified evaluation framework for large language models

45
Emerging
5 gabe-mousa/Apolien

AI Safety Evaluation Library

44
Emerging
6 GSA/FedRAMP-OllaLab-Lean

The OllaLab-Lean project is designed to help both novice and experienced...

41
Emerging
7 PromptMixerDev/prompt-mixer-app-ce

A desktop application for comparing outputs from different Large Language...

41
Emerging
8 babelcloud/LLM-RGB

LLM Reasoning and Generation Benchmark. Evaluate LLMs in complex scenarios...

41
Emerging
9 ryoungj/ToolEmu

[ICLR'24 Spotlight] A language model (LM)-based emulation framework for...

40
Emerging
10 ozturkoktay/insurance-llm-framework

An interactive framework for experimenting with and evaluating open-source...

37
Emerging
11 syamsasi99/prompt-evaluator

prompt-evaluator is an open-source toolkit for evaluating, testing, and...

34
Emerging
12 fau-masters-collected-works-cgarbin/llm-comparison-tool

A tool to compare multiple large language models (LLMs) side by side

33
Emerging
13 realadeel/llm-test-bench

Compare LLM providers (OpenAI, Claude, Gemini) for vision tasks - benchmark...

32
Emerging
14 pablo-chacon/Spoon-Bending

Educational analysis of LLM alignment, safety behavior, and...

31
Emerging
15 deadbits/trs

đź”­ Threat report analysis via LLM and Vector DB

29
Experimental
16 paradite/eval-data

Prompts and evaluation data for LLMs on real world coding and writing tasks

29
Experimental
17 lpr021/redteam-ai-benchmark

đź§Ş Evaluate uncensored LLMs for offensive security with targeted questions...

26
Experimental
18 thuanystuart/DD3412-chain-of-verification-reproduction

Re-implementation of the paper "Chain-of-Verification Reduces Hallucination...

25
Experimental
19 r4u-dev/open-r4u

Optimize AI & Maximize ROI of your LLM tasks. Evaluates current state and...

24
Experimental
20 EviAmarates/fresta-edge

Domain evaluation lens generator built on the Fresta Lens Framework

24
Experimental
21 LeNguyenAnhKhoa/Hallucination-Detection

Hallucination Detection using LLM's API

24
Experimental
22 sidoody/heart-context-pack

Compiling the HEART Score into a structured, model-facing policy artifact...

23
Experimental
23 SyntagmaNull/judgment-hygiene-stack

Tri-skill framework for structure routing, evidence discipline, and judgment...

23
Experimental
24 GnomeMan4201/drift-artifact

Stylometric drift experiment — documents that demonstrate iterative...

23
Experimental
25 petersimmons1972/brutal-evaluation

AI skill for brutally honest project feedback. Based on Dylan Davis's BRUTAL...

22
Experimental
26 Amir-ElBelawy/llm-failure-mode-taxonomy

A practitioner's taxonomy of recurring failure patterns in large language...

22
Experimental
27 IgnazioDS/evalops-workbench

A local-first evaluation harness for prompts, tools, and agents with...

22
Experimental
28 kepiCHelaSHen/context-hacking

Turn LLM priors into scientific rigor. Zero-drift multi-agent framework for...

22
Experimental
29 hideyuki001/unified-cognitive-os-v1.8

Judgment decomposition architecture for translation QA, ASR review, AI...

22
Experimental
30 joshualamerton/Modelbench

Concept: benchmarking harness for prompts, models, and agent strategies

22
Experimental
31 Ravevx/LLM-Spatial-Reasoning-Evaluation-2D-Physics-Puzzle

A benchmark environment for evaluating large language models’ spatial...

22
Experimental
32 jameswniu/self-hosted-llm-evals-lab

Evaluation framework for self-hosted LLMs. Systematic prompt ablation...

22
Experimental
33 Chunduri-Aditya/Model-Behavior-Lab

Local Ollama-based LLM evaluation platform that benchmarks reasoning,...

22
Experimental
34 maxpetrusenko/llm-eval-notes

Public LLM evaluation artifacts: hallucination, brittleness, structured...

22
Experimental
35 kogunlowo123/ai-evaluation-prompts

Prompt evaluation framework with accuracy, coherence, safety rubrics, and...

21
Experimental
36 hirbis/prompt-governance

Replication package for "Prompt Governance in Financial AI" (Girolli, 2026)....

21
Experimental
37 michaelflppv/prompt-llm-benchmark

Prompt LLM Bench is a platform that discovers compatible Hugging Face models...

21
Experimental
38 tpertner/squeeze

Squeeze your model with pressure prompts to see if its behavior leaks.

21
Experimental
39 reiidoda/OpenRe

Open-source AI agent evaluation workbench for benchmarking, tracing,...

21
Experimental
40 vivek8849/llm-trust-evaluator

A production-ready framework for evaluating LLM reliability using semantic...

21
Experimental
41 AspenXDev/job-evaluation-engine

Modular prompt-engineered system for deterministic job evaluation with...

21
Experimental
42 kanupriya-GuptaM/llm-agreement-bias-benchmark

Benchmark framework for detecting agreement bias and answer instability in...

21
Experimental
43 firechair/AI-Engineering-Critique

🚀 An interactive platform for LLM Preference Learning and Comparative...

21
Experimental
44 MVidicek/evalkit

Test your prompts like you test your code. Regression testing for LLM applications.

21
Experimental
45 mohosy/OpenEvals

Open-source eval studio for prompt comparisons, regression tracking, and...

21
Experimental
46 adityaarunsinghal/LLM-As-A-Judge-Prompt-Improver

Scientific framework for iterative LLM prompt improvement using...

21
Experimental
47 aaddii09/llm-eval-harness

🔍 Run efficient evaluations for prompt and LLM regression testing with this...

21
Experimental
48 Philipnil06/ai-output-quality-lab

A structured experiment framework for prompt variation, evaluation, and...

21
Experimental
49 MarcKarbowiak/ai-evaluation-harness

Production-minded evaluation harness for LLM features with structured...

21
Experimental
50 aleremfer/prompt-eval-cases

Prompt comparison and evaluation across multiple LLMs (EN/ES)

21
Experimental
51 ktjkc/reflextrust

🧠 LLMs don’t just process text — they read the room. Meaning emerges through...

20
Experimental
52 wzy6642/I3C-Select

Official implementation for "Instructing Large Language Models to Identify...

20
Experimental
53 GTMVP/modal-llm-evaluator

Run 1,000 LLM evaluations in 10 minutes. Test prompts across Claude, GPT-4,...

20
Experimental
54 kustonaut/llm-eval-kit

Quality scoring, eval suites, and regression detection for LLM outputs.

19
Experimental
55 chirindaopensource/llm_faithfulness_hallucination_misalignment_detection

End-to-End Python implementation of Semantic Divergence Metrics (SDM) for...

18
Experimental
56 noah-art3mis/crucible

Develop better LLM apps by testing different models and prompts in bulk.

18
Experimental
57 rlin25/FrizzlesRubric

A modular system for automated, multi-metric AI prompt evaluation—featuring...

18
Experimental
58 lkilefner/llm-quality-evaluation-examples

K–12 LLM evaluation examples using teacher-centered ground truths, rubrics,...

17
Experimental
59 FlosMume/LLM-Safety-Labs-Starter

Foundation for building safer generative-AI systems — includes example...

17
Experimental
60 aikenkyu001/semantic_roundtrip_benchmark_2

This repository contains the primary contributions of our research paper, "A...

17
Experimental
61 moses-shenassa/llm-prompt-framework-and-eval-suite

Prompt engineering framework + evaluation harness for LLM workflows...

17
Experimental
62 vihanga/prompt-sandbox

Testing framework for LLM prompts. Started as a weekend project after...

17
Experimental
63 antzedek/dar-quickfix

Runtime patch that kills LLM loops, drift & hallucinations in real-time –...

17
Experimental
64 ghazaleh-mahmoodi/Prompting_LLMs_AS_Explainable_Metrics

Eval4NLP Shared Task on Prompting Large Language Models as Explainable Metrics

17
Experimental
65 aikenkyu001/benchmarking_llm_against_prompt_formats

Official experimental environment for 'Benchmarking LLM Sensitivity to...

17
Experimental
66 rahul-sg/HondaResearchLabs_DSC180A-Eval-Systems-Of-NextGen-LLMs

Domain-aware LLM summary evaluation and iterative refinement pipeline with...

16
Experimental
67 ghazal001/LLM-C-Grading-Agent

Ongoing LLM-based grading agent for automated evaluation of C++ programming...

15
Experimental
68 BOSSMAN-dev89/LBOS-LCAS-LP-Contradiction-tracker

A tool for auditing bias through large language models

14
Experimental
69 Ziechoes/reasoning-invariance-benchmark

Experiments testing whether LLM reasoning trajectories remain invariant when...

14
Experimental
70 AW-VB/llm-mcq-benchmark

Benchmarking open-weight LLMs on multiple-choice QA with prompt comparison,...

14
Experimental
71 OptionalSoftware/concurrent

The Multi-LLM Benchmarking Tool

14
Experimental
72 antsuebae/TFG-LLM-RE

TFG: EvaluaciĂłn comparativa de LLMs locales vs. cloud en IngenierĂ­a de...

14
Experimental
73 bensonbabu93/llm-prompt-evaluation-framework

A prompt experimentation tool that benchmarks LLM responses across multiple...

14
Experimental
74 YifanHe0126/medical-mllm-evaluation

Evaluation and model selection workflow for open-source multimodal LLMs in...

14
Experimental
75 illogical/LMEval

Web application for systematic prompt engineering and model evaluation

14
Experimental
76 gamzeakkurt/Prompt-Evaluation-in-AWS-Bedrock

Prompt evaluation framework using AWS Bedrock to assess LLM outputs with...

14
Experimental
77 jharter-stack/prompt-evals

prompt-evals — Prompt testing, comparisons, refinements, and failure cases

14
Experimental
78 rahulthadhani/llm-benchmark

A benchmark suite that tests how zero-shot, few-shot, chain-of-thought, and...

14
Experimental
79 rechriti/llm-risk-analysis

LLM-based risk analysis system using prompt engineering and evaluation (NDA-safe)

14
Experimental
80 Codegrammer999/prompt-bench

This is a benchmark suite comparing zero-shot, few-shot, Chain-of-Thought,...

14
Experimental
81 sergeyklay/factly

CLI tool to evaluate LLM factuality on MMLU benchmark.

13
Experimental
82 motasemwed/llm-judge

LLM-as-a-Judge system for rubric-based, explainable evaluation of large...

13
Experimental
83 TheSkyBiz/llm-persona-drift-evaluation

945-generation adversarial evaluation of 3 open LLMs across 3 personas and...

13
Experimental
84 eugeniusms/TextualVerifier

LLM-Based Textual Verifier using Chain-of-Thought, Variant Generation, and...

13
Experimental
85 Laksh-star/ai-fluency-gym

Educational AI fluency self-assessment inspired by the 4D framework, with...

13
Experimental
86 1rajatk/content-judgment-calibrator

A judgment calibration framework for auditing content clarity, credibility,...

13
Experimental
87 sportixIndia/LBOS-LCAS-LP-Contradiction-tracker

🔍 Track contradictions in AI and human content with LBOS-LCAS, enhancing...

13
Experimental
88 YaswanthGhanta/llm-logical-integrity-benchmark

Adversarial testing of LLMs on constraint satisfaction deadlocks

13
Experimental
89 yuchenzhu-research/iclr2026-cao-prompt-drift-lab

A reproducible evaluation framework for studying how small prompt variations...

13
Experimental
90 GoodCODER280722/llm-output-validator

Rule-based AI output validation CLI tool (mock mode) with structured JSON reporting.

13
Experimental
91 mtchynkstff/llm-ed-eval

A reproducible evaluation framework analyzing how prompt strategies affect...

13
Experimental
92 jadhav045/DeepStack-AILM-Assignment

A strict, provider-agnostic User Input Validator powered exclusively by LLMs...

13
Experimental
93 SiemonCha/ECM3401-LLM-Essay-Scoring

Measuring semantic robustness in LLM-based CEFR essay scoring through...

13
Experimental
94 KSVQ/openrouter-harness

Lightweight OpenRouter evaluation harness with web UI, batch runs, and a...

13
Experimental
95 BrewLLM/brewval

Evaluate prompts for LLM applications

12
Experimental
96 Wind-Flipped/DyG-Hallucination

Cocktail dynamic graph prompting technique in LLM for hallucination

11
Experimental
97 jwalsh/amazon-bedrock-prompt-evaluation

Evaluating prompts at scale with Prompt Management and Prompt Flows for...

11
Experimental
98 ajdedeaux/ai-eval-framework

Systematic AI evaluation framework that transforms subjective assessment...

10
Experimental
99 meghajbhat/Reducing-Hallucinations-in-LLMs-using-Prompt-Engineering-Strategies

A comprehensive study on reducing hallucinations in Large Language Models...

10
Experimental
100 gdorleon/balm_llm

BALM: Bias-Aware Language Model with inference-time bias detection and correction.

10
Experimental

Comparisons in this category