qcri/LLMeBench
Benchmarking Large Language Models
This framework helps you objectively compare how well different large language models (LLMs) perform on specific language tasks, regardless of their source (like OpenAI or HuggingFace). You provide a dataset and a task (such as sentiment analysis or question answering), and it outputs a detailed report on each model's accuracy and behavior. It's designed for AI researchers, data scientists, and language model evaluators who need to rigorously test and select the best LLM for their application.
105 stars. No commits in the last 6 months. Available on PyPI.
Use this if you need to systematically evaluate and benchmark multiple large language models on various natural language processing tasks using your own or existing datasets.
Not ideal if you're looking for a simple tool to just apply one LLM to a task without needing to compare its performance against others.
Stars
105
Forks
21
Language
Python
License
—
Category
Last pushed
Jun 20, 2025
Commits (30d)
0
Dependencies
15
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/qcri/LLMeBench"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Compare
Higher-rated alternatives
stanfordnlp/axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
aidatatools/ollama-benchmark
LLM Benchmark for Throughput via Ollama (Local LLMs)
LarHope/ollama-benchmark
Ollama based Benchmark with detail I/O token per second. Python with Deepseek R1 example.
THUDM/LongBench
LongBench v2 and LongBench (ACL 25'&24')
microsoft/LLF-Bench
A benchmark for evaluating learning agents based on just language feedback