microsoft/promptbench

A unified evaluation framework for large language models

/ 100

Established

This tool helps researchers and practitioners systematically evaluate how well different Large Language Models (LLMs) perform on various tasks. You input specific LLMs and datasets (like MMLU, SQuAD, or multi-modal datasets such as VQAv2), along with different ways of crafting prompts, and it outputs performance metrics and insights into model robustness. It's designed for anyone who needs to benchmark LLMs for specific applications or research purposes.

2,785 stars. Available on PyPI.

Use this if you need a standardized way to compare the performance, robustness, and prompting effectiveness of various Large Language Models on diverse datasets, including advanced techniques like adversarial prompting or dynamic evaluation.

Not ideal if you are an end-user simply looking to apply an LLM to a task without needing to conduct deep comparative analysis or develop new evaluation methodologies.

LLM-evaluation prompt-engineering AI-benchmarking natural-language-processing multi-modal-AI

Maintenance 10 / 25

Adoption 10 / 25

Maturity 25 / 25

Community 19 / 25

How are scores calculated?

Stars

2,785

Forks

219

Language

Python

License

MIT

Featured in

You're Shipping AI You Can't Measure

Related tools

uptrain-ai/uptrain

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications....

levitation-opensource/Manipulative-Expression-Recognition

MER is a software that identifies and highlights manipulative communication in text from human...

microsoftarchive/promptbench

A unified evaluation framework for large language models

gabe-mousa/Apolien

AI Safety Evaluation Library

GSA/FedRAMP-OllaLab-Lean

The OllaLab-Lean project is designed to help both novice and experienced developers rapidly set...

Explore Prompt Engineering Tools

All categories Trending Prompt Engineering directory Insights