pyladiesams/eval-llm-based-apps-jan2025

Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.

/ 100

Emerging

This project helps developers build reliable LLM-based applications by providing a framework for continuous evaluation. It takes your LLM application's code and test data, and outputs performance metrics and insights for improvement. Developers, ML engineers, and data scientists working on production AI systems would use this.

No commits in the last 6 months.

Use this if you are developing an LLM-based application and need a robust way to test, evaluate, and monitor its performance throughout its lifecycle.

Not ideal if you are looking for a pre-built, ready-to-deploy LLM application or if you lack basic Python and ML testing knowledge.

LLM development AI testing MLOps application monitoring production AI

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 4 / 25

Maturity 16 / 25

Community 16 / 25

How are scores calculated?

Stars

Forks

Language

Jupyter Notebook

License

MIT

Higher-rated alternatives

eth-sri/matharena

Evaluation of LLMs on latest math competitions

tatsu-lab/alpaca_eval

An automatic evaluator for instruction-following language models. Human-validated, high-quality,...

HPAI-BSC/TuRTLe

TuRTLe: A Unified Evaluation of LLMs for RTL Generation 🐢 (MLCAD 2025)

nlp-uoregon/mlmm-evaluation

Multilingual Large Language Models Evaluation Benchmark

haesleinhuepf/human-eval-bia

Benchmarking Large Language Models for Bio-Image Analysis Code Generation

Explore Transformer Models

All categories Trending Transformer directory Insights