qcri/LLMeBench

Benchmarking Large Language Models

/ 100

Emerging

This framework helps you objectively compare how well different large language models (LLMs) perform on specific language tasks, regardless of their source (like OpenAI or HuggingFace). You provide a dataset and a task (such as sentiment analysis or question answering), and it outputs a detailed report on each model's accuracy and behavior. It's designed for AI researchers, data scientists, and language model evaluators who need to rigorously test and select the best LLM for their application.

105 stars. No commits in the last 6 months. Available on PyPI.

Use this if you need to systematically evaluate and benchmark multiple large language models on various natural language processing tasks using your own or existing datasets.

Not ideal if you're looking for a simple tool to just apply one LLM to a task without needing to compare its performance against others.

LLM evaluation NLP benchmarking AI model comparison language model testing computational linguistics

No License Stale 6m

Maintenance 2 / 25

Adoption 9 / 25

Maturity 17 / 25

Community 19 / 25

How are scores calculated?

Stars

105

Forks

Language

Python

License

—

Compare

LLMeBench and ollama-benchmark LLMeBench and LLF-Bench LLMeBench and llm-optimizer-benchmark

Higher-rated alternatives

stanfordnlp/axbench

Stanford NLP Python library for benchmarking the utility of LLM interpretability methods

aidatatools/ollama-benchmark

LLM Benchmark for Throughput via Ollama (Local LLMs)

LarHope/ollama-benchmark

Ollama based Benchmark with detail I/O token per second. Python with Deepseek R1 example.

THUDM/LongBench

LongBench v2 and LongBench (ACL 25'&24')

microsoft/LLF-Bench

A benchmark for evaluating learning agents based on just language feedback

Explore Transformer Models

All categories Trending Transformer directory Insights