LLM Evaluation Frameworks Prompt Engineering Tools

Systematic benchmarking and testing suites for evaluating LLM prompt strategies, output quality, consistency, and factuality across multiple models and tasks. Does NOT include prompt optimization tools, hallucination-reduction techniques alone, or general LLM deployment platforms.

There are 100 llm evaluation frameworks tools tracked. 2 score above 50 (established tier). The highest-rated is microsoft/promptbench at 64/100 with 2,785 stars.

Get all 100 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=prompt-engineering&subcategory=llm-evaluation-frameworks&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	microsoft/promptbench A unified evaluation framework for large language models	64	Established	2,785	Python
2	uptrain-ai/uptrain UpTrain is an open-source unified platform to evaluate and improve...	54	Established	2,339	Python
3	levitation-opensource/Manipulative-Expression-Recognition MER is a software that identifies and highlights manipulative communication...	45	Emerging	14	HTML
4	microsoftarchive/promptbench A unified evaluation framework for large language models	45	Emerging	2,787	Python
5	gabe-mousa/Apolien AI Safety Evaluation Library	44	Emerging	5	Python
6	GSA/FedRAMP-OllaLab-Lean The OllaLab-Lean project is designed to help both novice and experienced...	41	Emerging	29	Jupyter Notebook
7	PromptMixerDev/prompt-mixer-app-ce A desktop application for comparing outputs from different Large Language...	41	Emerging	84	TypeScript
8	babelcloud/LLM-RGB LLM Reasoning and Generation Benchmark. Evaluate LLMs in complex scenarios...	41	Emerging	166	TypeScript
9	ryoungj/ToolEmu [ICLR'24 Spotlight] A language model (LM)-based emulation framework for...	40	Emerging	192	Python
10	ozturkoktay/insurance-llm-framework An interactive framework for experimenting with and evaluating open-source...	37	Emerging	13	Python
11	syamsasi99/prompt-evaluator prompt-evaluator is an open-source toolkit for evaluating, testing, and...	34	Emerging	4	TypeScript
12	fau-masters-collected-works-cgarbin/llm-comparison-tool A tool to compare multiple large language models (LLMs) side by side	33	Emerging	6	Python
13	realadeel/llm-test-bench Compare LLM providers (OpenAI, Claude, Gemini) for vision tasks - benchmark...	32	Emerging	4	Python
14	pablo-chacon/Spoon-Bending Educational analysis of LLM alignment, safety behavior, and...	31	Emerging	22	—
15	deadbits/trs 🔭 Threat report analysis via LLM and Vector DB	29	Experimental	9	Python
16	paradite/eval-data Prompts and evaluation data for LLMs on real world coding and writing tasks	29	Experimental	17	TypeScript
17	lpr021/redteam-ai-benchmark 🧪 Evaluate uncensored LLMs for offensive security with targeted questions...	26	Experimental	1	Python
18	thuanystuart/DD3412-chain-of-verification-reproduction Re-implementation of the paper "Chain-of-Verification Reduces Hallucination...	25	Experimental	6	Python
19	r4u-dev/open-r4u Optimize AI & Maximize ROI of your LLM tasks. Evaluates current state and...	24	Experimental	3	Python
20	EviAmarates/fresta-edge Domain evaluation lens generator built on the Fresta Lens Framework	24	Experimental	3	Python
21	LeNguyenAnhKhoa/Hallucination-Detection Hallucination Detection using LLM's API	24	Experimental	4	Jupyter Notebook
22	sidoody/heart-context-pack Compiling the HEART Score into a structured, model-facing policy artifact...	23	Experimental	1	Python
23	SyntagmaNull/judgment-hygiene-stack Tri-skill framework for structure routing, evidence discipline, and judgment...	23	Experimental	1	—
24	GnomeMan4201/drift-artifact Stylometric drift experiment — documents that demonstrate iterative...	23	Experimental	1	HTML
25	petersimmons1972/brutal-evaluation AI skill for brutally honest project feedback. Based on Dylan Davis's BRUTAL...	22	Experimental	1	—
26	Amir-ElBelawy/llm-failure-mode-taxonomy A practitioner's taxonomy of recurring failure patterns in large language...	22	Experimental	—	—
27	IgnazioDS/evalops-workbench A local-first evaluation harness for prompts, tools, and agents with...	22	Experimental	—	HTML
28	kepiCHelaSHen/context-hacking Turn LLM priors into scientific rigor. Zero-drift multi-agent framework for...	22	Experimental	—	Python
29	hideyuki001/unified-cognitive-os-v1.8 Judgment decomposition architecture for translation QA, ASR review, AI...	22	Experimental	—	—
30	joshualamerton/Modelbench Concept: benchmarking harness for prompts, models, and agent strategies	22	Experimental	1	Python
31	Ravevx/LLM-Spatial-Reasoning-Evaluation-2D-Physics-Puzzle A benchmark environment for evaluating large language models’ spatial...	22	Experimental	1	HTML
32	jameswniu/self-hosted-llm-evals-lab Evaluation framework for self-hosted LLMs. Systematic prompt ablation...	22	Experimental	1	Python
33	Chunduri-Aditya/Model-Behavior-Lab Local Ollama-based LLM evaluation platform that benchmarks reasoning,...	22	Experimental	1	Python
34	maxpetrusenko/llm-eval-notes Public LLM evaluation artifacts: hallucination, brittleness, structured...	22	Experimental	1	Python
35	kogunlowo123/ai-evaluation-prompts Prompt evaluation framework with accuracy, coherence, safety rubrics, and...	21	Experimental	—	Python
36	hirbis/prompt-governance Replication package for "Prompt Governance in Financial AI" (Girolli, 2026)....	21	Experimental	—	Python
37	michaelflppv/prompt-llm-benchmark Prompt LLM Bench is a platform that discovers compatible Hugging Face models...	21	Experimental	—	TypeScript
38	tpertner/squeeze Squeeze your model with pressure prompts to see if its behavior leaks.	21	Experimental	—	Python
39	reiidoda/OpenRe Open-source AI agent evaluation workbench for benchmarking, tracing,...	21	Experimental	—	Python
40	vivek8849/llm-trust-evaluator A production-ready framework for evaluating LLM reliability using semantic...	21	Experimental	—	Jupyter Notebook
41	AspenXDev/job-evaluation-engine Modular prompt-engineered system for deterministic job evaluation with...	21	Experimental	—	—
42	kanupriya-GuptaM/llm-agreement-bias-benchmark Benchmark framework for detecting agreement bias and answer instability in...	21	Experimental	—	—
43	firechair/AI-Engineering-Critique 🚀 An interactive platform for LLM Preference Learning and Comparative...	21	Experimental	—	Python
44	MVidicek/evalkit Test your prompts like you test your code. Regression testing for LLM applications.	21	Experimental	—	Python
45	mohosy/OpenEvals Open-source eval studio for prompt comparisons, regression tracking, and...	21	Experimental	—	Python
46	adityaarunsinghal/LLM-As-A-Judge-Prompt-Improver Scientific framework for iterative LLM prompt improvement using...	21	Experimental	—	Python
47	aaddii09/llm-eval-harness 🔍 Run efficient evaluations for prompt and LLM regression testing with this...	21	Experimental	—	Python
48	Philipnil06/ai-output-quality-lab A structured experiment framework for prompt variation, evaluation, and...	21	Experimental	—	Python
49	MarcKarbowiak/ai-evaluation-harness Production-minded evaluation harness for LLM features with structured...	21	Experimental	—	Python
50	aleremfer/prompt-eval-cases Prompt comparison and evaluation across multiple LLMs (EN/ES)	21	Experimental	—	—
51	ktjkc/reflextrust 🧠 LLMs don’t just process text — they read the room. Meaning emerges through...	20	Experimental	4	—
52	wzy6642/I3C-Select Official implementation for "Instructing Large Language Models to Identify...	20	Experimental	8	Python
53	GTMVP/modal-llm-evaluator Run 1,000 LLM evaluations in 10 minutes. Test prompts across Claude, GPT-4,...	20	Experimental	1	Python
54	kustonaut/llm-eval-kit Quality scoring, eval suites, and regression detection for LLM outputs.	19	Experimental	—	Python
55	chirindaopensource/llm_faithfulness_hallucination_misalignment_detection End-to-End Python implementation of Semantic Divergence Metrics (SDM) for...	18	Experimental	1	Jupyter Notebook
56	noah-art3mis/crucible Develop better LLM apps by testing different models and prompts in bulk.	18	Experimental	2	Python
57	rlin25/FrizzlesRubric A modular system for automated, multi-metric AI prompt evaluation—featuring...	18	Experimental	1	Python
58	lkilefner/llm-quality-evaluation-examples K–12 LLM evaluation examples using teacher-centered ground truths, rubrics,...	17	Experimental	—	—
59	FlosMume/LLM-Safety-Labs-Starter Foundation for building safer generative-AI systems — includes example...	17	Experimental	—	Python
60	aikenkyu001/semantic_roundtrip_benchmark_2 This repository contains the primary contributions of our research paper, "A...	17	Experimental	—	TeX
61	moses-shenassa/llm-prompt-framework-and-eval-suite Prompt engineering framework + evaluation harness for LLM workflows...	17	Experimental	—	Python
62	vihanga/prompt-sandbox Testing framework for LLM prompts. Started as a weekend project after...	17	Experimental	—	Python
63	antzedek/dar-quickfix Runtime patch that kills LLM loops, drift & hallucinations in real-time –...	17	Experimental	—	Python
64	ghazaleh-mahmoodi/Prompting_LLMs_AS_Explainable_Metrics Eval4NLP Shared Task on Prompting Large Language Models as Explainable Metrics	17	Experimental	1	Python
65	aikenkyu001/benchmarking_llm_against_prompt_formats Official experimental environment for 'Benchmarking LLM Sensitivity to...	17	Experimental	—	Python
66	rahul-sg/HondaResearchLabs_DSC180A-Eval-Systems-Of-NextGen-LLMs Domain-aware LLM summary evaluation and iterative refinement pipeline with...	16	Experimental	1	Python
67	ghazal001/LLM-C-Grading-Agent Ongoing LLM-based grading agent for automated evaluation of C++ programming...	15	Experimental	—	Python
68	BOSSMAN-dev89/LBOS-LCAS-LP-Contradiction-tracker A tool for auditing bias through large language models	14	Experimental	1	Python
69	Ziechoes/reasoning-invariance-benchmark Experiments testing whether LLM reasoning trajectories remain invariant when...	14	Experimental	1	Python
70	AW-VB/llm-mcq-benchmark Benchmarking open-weight LLMs on multiple-choice QA with prompt comparison,...	14	Experimental	—	Jupyter Notebook
71	OptionalSoftware/concurrent The Multi-LLM Benchmarking Tool	14	Experimental	3	—
72	antsuebae/TFG-LLM-RE TFG: Evaluación comparativa de LLMs locales vs. cloud en Ingeniería de...	14	Experimental	—	Python
73	bensonbabu93/llm-prompt-evaluation-framework A prompt experimentation tool that benchmarks LLM responses across multiple...	14	Experimental	—	Python
74	YifanHe0126/medical-mllm-evaluation Evaluation and model selection workflow for open-source multimodal LLMs in...	14	Experimental	—	—
75	illogical/LMEval Web application for systematic prompt engineering and model evaluation	14	Experimental	—	TypeScript
76	gamzeakkurt/Prompt-Evaluation-in-AWS-Bedrock Prompt evaluation framework using AWS Bedrock to assess LLM outputs with...	14	Experimental	—	—
77	jharter-stack/prompt-evals prompt-evals — Prompt testing, comparisons, refinements, and failure cases	14	Experimental	—	—
78	rahulthadhani/llm-benchmark A benchmark suite that tests how zero-shot, few-shot, chain-of-thought, and...	14	Experimental	—	Python
79	rechriti/llm-risk-analysis LLM-based risk analysis system using prompt engineering and evaluation (NDA-safe)	14	Experimental	—	—
80	Codegrammer999/prompt-bench This is a benchmark suite comparing zero-shot, few-shot, Chain-of-Thought,...	14	Experimental	1	Python
81	sergeyklay/factly CLI tool to evaluate LLM factuality on MMLU benchmark.	13	Experimental	2	Python
82	motasemwed/llm-judge LLM-as-a-Judge system for rubric-based, explainable evaluation of large...	13	Experimental	—	Python
83	TheSkyBiz/llm-persona-drift-evaluation 945-generation adversarial evaluation of 3 open LLMs across 3 personas and...	13	Experimental	—	Python
84	eugeniusms/TextualVerifier LLM-Based Textual Verifier using Chain-of-Thought, Variant Generation, and...	13	Experimental	—	Python
85	Laksh-star/ai-fluency-gym Educational AI fluency self-assessment inspired by the 4D framework, with...	13	Experimental	—	TypeScript
86	1rajatk/content-judgment-calibrator A judgment calibration framework for auditing content clarity, credibility,...	13	Experimental	—	—
87	sportixIndia/LBOS-LCAS-LP-Contradiction-tracker 🔍 Track contradictions in AI and human content with LBOS-LCAS, enhancing...	13	Experimental	—	Python
88	YaswanthGhanta/llm-logical-integrity-benchmark Adversarial testing of LLMs on constraint satisfaction deadlocks	13	Experimental	—	—
89	yuchenzhu-research/iclr2026-cao-prompt-drift-lab A reproducible evaluation framework for studying how small prompt variations...	13	Experimental	—	TeX
90	GoodCODER280722/llm-output-validator Rule-based AI output validation CLI tool (mock mode) with structured JSON reporting.	13	Experimental	—	Python
91	mtchynkstff/llm-ed-eval A reproducible evaluation framework analyzing how prompt strategies affect...	13	Experimental	—	—
92	jadhav045/DeepStack-AILM-Assignment A strict, provider-agnostic User Input Validator powered exclusively by LLMs...	13	Experimental	—	Python
93	SiemonCha/ECM3401-LLM-Essay-Scoring Measuring semantic robustness in LLM-based CEFR essay scoring through...	13	Experimental	—	Python
94	KSVQ/openrouter-harness Lightweight OpenRouter evaluation harness with web UI, batch runs, and a...	13	Experimental	—	Python
95	BrewLLM/brewval Evaluate prompts for LLM applications	12	Experimental	8	Python
96	Wind-Flipped/DyG-Hallucination Cocktail dynamic graph prompting technique in LLM for hallucination	11	Experimental	3	Python
97	jwalsh/amazon-bedrock-prompt-evaluation Evaluating prompts at scale with Prompt Management and Prompt Flows for...	11	Experimental	—	Python
98	ajdedeaux/ai-eval-framework Systematic AI evaluation framework that transforms subjective assessment...	10	Experimental	1	—
99	meghajbhat/Reducing-Hallucinations-in-LLMs-using-Prompt-Engineering-Strategies A comprehensive study on reducing hallucinations in Large Language Models...	10	Experimental	1	Jupyter Notebook
100	gdorleon/balm_llm BALM: Bias-Aware Language Model with inference-time bias detection and correction.	10	Experimental	1	Python

Comparisons in this category

prompt-evaluator and eval-data (34 vs 29)