grigio/llm-eval-simple

llm-eval-simple is a simple LLM evaluation framework with intermediate actions and prompt pattern selection

/ 100

Emerging

This tool helps AI engineers and machine learning practitioners test different Large Language Models (LLMs) against a set of prompts and their expected answers. You provide text prompts and their correct responses, and the tool evaluates how accurately and quickly each model performs, giving you a detailed report and an interactive dashboard to compare results. It's ideal for anyone looking to benchmark and select the best LLM for specific tasks.

Use this if you need to systematically compare the performance (accuracy and speed) of multiple LLMs on your custom datasets and understand which models are best suited for your applications.

Not ideal if you need to evaluate the qualitative aspects of LLM outputs (like creativity or fluency) that can't be judged by exact matching or a simple AI evaluator model.

LLM-benchmarking AI-model-evaluation prompt-engineering machine-learning-operations performance-testing

No Package No Dependents

Maintenance 10 / 25

Adoption 8 / 25

Maturity 15 / 25

Community 3 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Higher-rated alternatives

eth-sri/matharena

Evaluation of LLMs on latest math competitions

tatsu-lab/alpaca_eval

An automatic evaluator for instruction-following language models. Human-validated, high-quality,...

HPAI-BSC/TuRTLe

TuRTLe: A Unified Evaluation of LLMs for RTL Generation 🐢 (MLCAD 2025)

nlp-uoregon/mlmm-evaluation

Multilingual Large Language Models Evaluation Benchmark

haesleinhuepf/human-eval-bia

Benchmarking Large Language Models for Bio-Image Analysis Code Generation

Explore Transformer Models

All categories Trending Transformer directory Insights