LLM Evaluation Frameworks Prompt Engineering Tools
Systematic benchmarking and testing suites for evaluating LLM prompt strategies, output quality, consistency, and factuality across multiple models and tasks. Does NOT include prompt optimization tools, hallucination-reduction techniques alone, or general LLM deployment platforms.
There are 100 llm evaluation frameworks tools tracked. 2 score above 50 (established tier). The highest-rated is microsoft/promptbench at 64/100 with 2,785 stars.
Get all 100 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=prompt-engineering&subcategory=llm-evaluation-frameworks&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
microsoft/promptbench
A unified evaluation framework for large language models |
|
Established |
| 2 |
uptrain-ai/uptrain
UpTrain is an open-source unified platform to evaluate and improve... |
|
Established |
| 3 |
levitation-opensource/Manipulative-Expression-Recognition
MER is a software that identifies and highlights manipulative communication... |
|
Emerging |
| 4 |
microsoftarchive/promptbench
A unified evaluation framework for large language models |
|
Emerging |
| 5 |
gabe-mousa/Apolien
AI Safety Evaluation Library |
|
Emerging |
| 6 |
GSA/FedRAMP-OllaLab-Lean
The OllaLab-Lean project is designed to help both novice and experienced... |
|
Emerging |
| 7 |
PromptMixerDev/prompt-mixer-app-ce
A desktop application for comparing outputs from different Large Language... |
|
Emerging |
| 8 |
babelcloud/LLM-RGB
LLM Reasoning and Generation Benchmark. Evaluate LLMs in complex scenarios... |
|
Emerging |
| 9 |
ryoungj/ToolEmu
[ICLR'24 Spotlight] A language model (LM)-based emulation framework for... |
|
Emerging |
| 10 |
ozturkoktay/insurance-llm-framework
An interactive framework for experimenting with and evaluating open-source... |
|
Emerging |
| 11 |
syamsasi99/prompt-evaluator
prompt-evaluator is an open-source toolkit for evaluating, testing, and... |
|
Emerging |
| 12 |
fau-masters-collected-works-cgarbin/llm-comparison-tool
A tool to compare multiple large language models (LLMs) side by side |
|
Emerging |
| 13 |
realadeel/llm-test-bench
Compare LLM providers (OpenAI, Claude, Gemini) for vision tasks - benchmark... |
|
Emerging |
| 14 |
pablo-chacon/Spoon-Bending
Educational analysis of LLM alignment, safety behavior, and... |
|
Emerging |
| 15 |
deadbits/trs
đź” Threat report analysis via LLM and Vector DB |
|
Experimental |
| 16 |
paradite/eval-data
Prompts and evaluation data for LLMs on real world coding and writing tasks |
|
Experimental |
| 17 |
lpr021/redteam-ai-benchmark
đź§Ş Evaluate uncensored LLMs for offensive security with targeted questions... |
|
Experimental |
| 18 |
thuanystuart/DD3412-chain-of-verification-reproduction
Re-implementation of the paper "Chain-of-Verification Reduces Hallucination... |
|
Experimental |
| 19 |
r4u-dev/open-r4u
Optimize AI & Maximize ROI of your LLM tasks. Evaluates current state and... |
|
Experimental |
| 20 |
EviAmarates/fresta-edge
Domain evaluation lens generator built on the Fresta Lens Framework |
|
Experimental |
| 21 |
LeNguyenAnhKhoa/Hallucination-Detection
Hallucination Detection using LLM's API |
|
Experimental |
| 22 |
sidoody/heart-context-pack
Compiling the HEART Score into a structured, model-facing policy artifact... |
|
Experimental |
| 23 |
SyntagmaNull/judgment-hygiene-stack
Tri-skill framework for structure routing, evidence discipline, and judgment... |
|
Experimental |
| 24 |
GnomeMan4201/drift-artifact
Stylometric drift experiment — documents that demonstrate iterative... |
|
Experimental |
| 25 |
petersimmons1972/brutal-evaluation
AI skill for brutally honest project feedback. Based on Dylan Davis's BRUTAL... |
|
Experimental |
| 26 |
Amir-ElBelawy/llm-failure-mode-taxonomy
A practitioner's taxonomy of recurring failure patterns in large language... |
|
Experimental |
| 27 |
IgnazioDS/evalops-workbench
A local-first evaluation harness for prompts, tools, and agents with... |
|
Experimental |
| 28 |
kepiCHelaSHen/context-hacking
Turn LLM priors into scientific rigor. Zero-drift multi-agent framework for... |
|
Experimental |
| 29 |
hideyuki001/unified-cognitive-os-v1.8
Judgment decomposition architecture for translation QA, ASR review, AI... |
|
Experimental |
| 30 |
joshualamerton/Modelbench
Concept: benchmarking harness for prompts, models, and agent strategies |
|
Experimental |
| 31 |
Ravevx/LLM-Spatial-Reasoning-Evaluation-2D-Physics-Puzzle
A benchmark environment for evaluating large language models’ spatial... |
|
Experimental |
| 32 |
jameswniu/self-hosted-llm-evals-lab
Evaluation framework for self-hosted LLMs. Systematic prompt ablation... |
|
Experimental |
| 33 |
Chunduri-Aditya/Model-Behavior-Lab
Local Ollama-based LLM evaluation platform that benchmarks reasoning,... |
|
Experimental |
| 34 |
maxpetrusenko/llm-eval-notes
Public LLM evaluation artifacts: hallucination, brittleness, structured... |
|
Experimental |
| 35 |
kogunlowo123/ai-evaluation-prompts
Prompt evaluation framework with accuracy, coherence, safety rubrics, and... |
|
Experimental |
| 36 |
hirbis/prompt-governance
Replication package for "Prompt Governance in Financial AI" (Girolli, 2026).... |
|
Experimental |
| 37 |
michaelflppv/prompt-llm-benchmark
Prompt LLM Bench is a platform that discovers compatible Hugging Face models... |
|
Experimental |
| 38 |
tpertner/squeeze
Squeeze your model with pressure prompts to see if its behavior leaks. |
|
Experimental |
| 39 |
reiidoda/OpenRe
Open-source AI agent evaluation workbench for benchmarking, tracing,... |
|
Experimental |
| 40 |
vivek8849/llm-trust-evaluator
A production-ready framework for evaluating LLM reliability using semantic... |
|
Experimental |
| 41 |
AspenXDev/job-evaluation-engine
Modular prompt-engineered system for deterministic job evaluation with... |
|
Experimental |
| 42 |
kanupriya-GuptaM/llm-agreement-bias-benchmark
Benchmark framework for detecting agreement bias and answer instability in... |
|
Experimental |
| 43 |
firechair/AI-Engineering-Critique
🚀 An interactive platform for LLM Preference Learning and Comparative... |
|
Experimental |
| 44 |
MVidicek/evalkit
Test your prompts like you test your code. Regression testing for LLM applications. |
|
Experimental |
| 45 |
mohosy/OpenEvals
Open-source eval studio for prompt comparisons, regression tracking, and... |
|
Experimental |
| 46 |
adityaarunsinghal/LLM-As-A-Judge-Prompt-Improver
Scientific framework for iterative LLM prompt improvement using... |
|
Experimental |
| 47 |
aaddii09/llm-eval-harness
🔍 Run efficient evaluations for prompt and LLM regression testing with this... |
|
Experimental |
| 48 |
Philipnil06/ai-output-quality-lab
A structured experiment framework for prompt variation, evaluation, and... |
|
Experimental |
| 49 |
MarcKarbowiak/ai-evaluation-harness
Production-minded evaluation harness for LLM features with structured... |
|
Experimental |
| 50 |
aleremfer/prompt-eval-cases
Prompt comparison and evaluation across multiple LLMs (EN/ES) |
|
Experimental |
| 51 |
ktjkc/reflextrust
🧠LLMs don’t just process text — they read the room. Meaning emerges through... |
|
Experimental |
| 52 |
wzy6642/I3C-Select
Official implementation for "Instructing Large Language Models to Identify... |
|
Experimental |
| 53 |
GTMVP/modal-llm-evaluator
Run 1,000 LLM evaluations in 10 minutes. Test prompts across Claude, GPT-4,... |
|
Experimental |
| 54 |
kustonaut/llm-eval-kit
Quality scoring, eval suites, and regression detection for LLM outputs. |
|
Experimental |
| 55 |
chirindaopensource/llm_faithfulness_hallucination_misalignment_detection
End-to-End Python implementation of Semantic Divergence Metrics (SDM) for... |
|
Experimental |
| 56 |
noah-art3mis/crucible
Develop better LLM apps by testing different models and prompts in bulk. |
|
Experimental |
| 57 |
rlin25/FrizzlesRubric
A modular system for automated, multi-metric AI prompt evaluation—featuring... |
|
Experimental |
| 58 |
lkilefner/llm-quality-evaluation-examples
K–12 LLM evaluation examples using teacher-centered ground truths, rubrics,... |
|
Experimental |
| 59 |
FlosMume/LLM-Safety-Labs-Starter
Foundation for building safer generative-AI systems — includes example... |
|
Experimental |
| 60 |
aikenkyu001/semantic_roundtrip_benchmark_2
This repository contains the primary contributions of our research paper, "A... |
|
Experimental |
| 61 |
moses-shenassa/llm-prompt-framework-and-eval-suite
Prompt engineering framework + evaluation harness for LLM workflows... |
|
Experimental |
| 62 |
vihanga/prompt-sandbox
Testing framework for LLM prompts. Started as a weekend project after... |
|
Experimental |
| 63 |
antzedek/dar-quickfix
Runtime patch that kills LLM loops, drift & hallucinations in real-time –... |
|
Experimental |
| 64 |
ghazaleh-mahmoodi/Prompting_LLMs_AS_Explainable_Metrics
Eval4NLP Shared Task on Prompting Large Language Models as Explainable Metrics |
|
Experimental |
| 65 |
aikenkyu001/benchmarking_llm_against_prompt_formats
Official experimental environment for 'Benchmarking LLM Sensitivity to... |
|
Experimental |
| 66 |
rahul-sg/HondaResearchLabs_DSC180A-Eval-Systems-Of-NextGen-LLMs
Domain-aware LLM summary evaluation and iterative refinement pipeline with... |
|
Experimental |
| 67 |
ghazal001/LLM-C-Grading-Agent
Ongoing LLM-based grading agent for automated evaluation of C++ programming... |
|
Experimental |
| 68 |
BOSSMAN-dev89/LBOS-LCAS-LP-Contradiction-tracker
A tool for auditing bias through large language models |
|
Experimental |
| 69 |
Ziechoes/reasoning-invariance-benchmark
Experiments testing whether LLM reasoning trajectories remain invariant when... |
|
Experimental |
| 70 |
AW-VB/llm-mcq-benchmark
Benchmarking open-weight LLMs on multiple-choice QA with prompt comparison,... |
|
Experimental |
| 71 |
OptionalSoftware/concurrent
The Multi-LLM Benchmarking Tool |
|
Experimental |
| 72 |
antsuebae/TFG-LLM-RE
TFG: EvaluaciĂłn comparativa de LLMs locales vs. cloud en IngenierĂa de... |
|
Experimental |
| 73 |
bensonbabu93/llm-prompt-evaluation-framework
A prompt experimentation tool that benchmarks LLM responses across multiple... |
|
Experimental |
| 74 |
YifanHe0126/medical-mllm-evaluation
Evaluation and model selection workflow for open-source multimodal LLMs in... |
|
Experimental |
| 75 |
illogical/LMEval
Web application for systematic prompt engineering and model evaluation |
|
Experimental |
| 76 |
gamzeakkurt/Prompt-Evaluation-in-AWS-Bedrock
Prompt evaluation framework using AWS Bedrock to assess LLM outputs with... |
|
Experimental |
| 77 |
jharter-stack/prompt-evals
prompt-evals — Prompt testing, comparisons, refinements, and failure cases |
|
Experimental |
| 78 |
rahulthadhani/llm-benchmark
A benchmark suite that tests how zero-shot, few-shot, chain-of-thought, and... |
|
Experimental |
| 79 |
rechriti/llm-risk-analysis
LLM-based risk analysis system using prompt engineering and evaluation (NDA-safe) |
|
Experimental |
| 80 |
Codegrammer999/prompt-bench
This is a benchmark suite comparing zero-shot, few-shot, Chain-of-Thought,... |
|
Experimental |
| 81 |
sergeyklay/factly
CLI tool to evaluate LLM factuality on MMLU benchmark. |
|
Experimental |
| 82 |
motasemwed/llm-judge
LLM-as-a-Judge system for rubric-based, explainable evaluation of large... |
|
Experimental |
| 83 |
TheSkyBiz/llm-persona-drift-evaluation
945-generation adversarial evaluation of 3 open LLMs across 3 personas and... |
|
Experimental |
| 84 |
eugeniusms/TextualVerifier
LLM-Based Textual Verifier using Chain-of-Thought, Variant Generation, and... |
|
Experimental |
| 85 |
Laksh-star/ai-fluency-gym
Educational AI fluency self-assessment inspired by the 4D framework, with... |
|
Experimental |
| 86 |
1rajatk/content-judgment-calibrator
A judgment calibration framework for auditing content clarity, credibility,... |
|
Experimental |
| 87 |
sportixIndia/LBOS-LCAS-LP-Contradiction-tracker
🔍 Track contradictions in AI and human content with LBOS-LCAS, enhancing... |
|
Experimental |
| 88 |
YaswanthGhanta/llm-logical-integrity-benchmark
Adversarial testing of LLMs on constraint satisfaction deadlocks |
|
Experimental |
| 89 |
yuchenzhu-research/iclr2026-cao-prompt-drift-lab
A reproducible evaluation framework for studying how small prompt variations... |
|
Experimental |
| 90 |
GoodCODER280722/llm-output-validator
Rule-based AI output validation CLI tool (mock mode) with structured JSON reporting. |
|
Experimental |
| 91 |
mtchynkstff/llm-ed-eval
A reproducible evaluation framework analyzing how prompt strategies affect... |
|
Experimental |
| 92 |
jadhav045/DeepStack-AILM-Assignment
A strict, provider-agnostic User Input Validator powered exclusively by LLMs... |
|
Experimental |
| 93 |
SiemonCha/ECM3401-LLM-Essay-Scoring
Measuring semantic robustness in LLM-based CEFR essay scoring through... |
|
Experimental |
| 94 |
KSVQ/openrouter-harness
Lightweight OpenRouter evaluation harness with web UI, batch runs, and a... |
|
Experimental |
| 95 |
BrewLLM/brewval
Evaluate prompts for LLM applications |
|
Experimental |
| 96 |
Wind-Flipped/DyG-Hallucination
Cocktail dynamic graph prompting technique in LLM for hallucination |
|
Experimental |
| 97 |
jwalsh/amazon-bedrock-prompt-evaluation
Evaluating prompts at scale with Prompt Management and Prompt Flows for... |
|
Experimental |
| 98 |
ajdedeaux/ai-eval-framework
Systematic AI evaluation framework that transforms subjective assessment... |
|
Experimental |
| 99 |
meghajbhat/Reducing-Hallucinations-in-LLMs-using-Prompt-Engineering-Strategies
A comprehensive study on reducing hallucinations in Large Language Models... |
|
Experimental |
| 100 |
gdorleon/balm_llm
BALM: Bias-Aware Language Model with inference-time bias detection and correction. |
|
Experimental |