microsoft/promptbench

A unified evaluation framework for large language models

64
/ 100
Established

This tool helps researchers and practitioners systematically evaluate how well different Large Language Models (LLMs) perform on various tasks. You input specific LLMs and datasets (like MMLU, SQuAD, or multi-modal datasets such as VQAv2), along with different ways of crafting prompts, and it outputs performance metrics and insights into model robustness. It's designed for anyone who needs to benchmark LLMs for specific applications or research purposes.

2,785 stars. Available on PyPI.

Use this if you need a standardized way to compare the performance, robustness, and prompting effectiveness of various Large Language Models on diverse datasets, including advanced techniques like adversarial prompting or dynamic evaluation.

Not ideal if you are an end-user simply looking to apply an LLM to a task without needing to conduct deep comparative analysis or develop new evaluation methodologies.

LLM-evaluation prompt-engineering AI-benchmarking natural-language-processing multi-modal-AI
Maintenance 10 / 25
Adoption 10 / 25
Maturity 25 / 25
Community 19 / 25

How are scores calculated?

Stars

2,785

Forks

219

Language

Python

License

MIT

Last pushed

Feb 20, 2026

Commits (30d)

0

Dependencies

18

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/prompt-engineering/microsoft/promptbench"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.