qcri/LLMeBench

Benchmarking Large Language Models

47
/ 100
Emerging

This framework helps you objectively compare how well different large language models (LLMs) perform on specific language tasks, regardless of their source (like OpenAI or HuggingFace). You provide a dataset and a task (such as sentiment analysis or question answering), and it outputs a detailed report on each model's accuracy and behavior. It's designed for AI researchers, data scientists, and language model evaluators who need to rigorously test and select the best LLM for their application.

105 stars. No commits in the last 6 months. Available on PyPI.

Use this if you need to systematically evaluate and benchmark multiple large language models on various natural language processing tasks using your own or existing datasets.

Not ideal if you're looking for a simple tool to just apply one LLM to a task without needing to compare its performance against others.

LLM evaluation NLP benchmarking AI model comparison language model testing computational linguistics
No License Stale 6m
Maintenance 2 / 25
Adoption 9 / 25
Maturity 17 / 25
Community 19 / 25

How are scores calculated?

Stars

105

Forks

21

Language

Python

License

Last pushed

Jun 20, 2025

Commits (30d)

0

Dependencies

15

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/qcri/LLMeBench"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.