microsoft/promptbench
A unified evaluation framework for large language models
This tool helps researchers and practitioners systematically evaluate how well different Large Language Models (LLMs) perform on various tasks. You input specific LLMs and datasets (like MMLU, SQuAD, or multi-modal datasets such as VQAv2), along with different ways of crafting prompts, and it outputs performance metrics and insights into model robustness. It's designed for anyone who needs to benchmark LLMs for specific applications or research purposes.
2,785 stars. Available on PyPI.
Use this if you need a standardized way to compare the performance, robustness, and prompting effectiveness of various Large Language Models on diverse datasets, including advanced techniques like adversarial prompting or dynamic evaluation.
Not ideal if you are an end-user simply looking to apply an LLM to a task without needing to conduct deep comparative analysis or develop new evaluation methodologies.
Stars
2,785
Forks
219
Language
Python
License
MIT
Category
Last pushed
Feb 20, 2026
Commits (30d)
0
Dependencies
18
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/prompt-engineering/microsoft/promptbench"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Related tools
uptrain-ai/uptrain
UpTrain is an open-source unified platform to evaluate and improve Generative AI applications....
levitation-opensource/Manipulative-Expression-Recognition
MER is a software that identifies and highlights manipulative communication in text from human...
microsoftarchive/promptbench
A unified evaluation framework for large language models
gabe-mousa/Apolien
AI Safety Evaluation Library
GSA/FedRAMP-OllaLab-Lean
The OllaLab-Lean project is designed to help both novice and experienced developers rapidly set...