pyladiesams/eval-llm-based-apps-jan2025
Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.
This project helps developers build reliable LLM-based applications by providing a framework for continuous evaluation. It takes your LLM application's code and test data, and outputs performance metrics and insights for improvement. Developers, ML engineers, and data scientists working on production AI systems would use this.
No commits in the last 6 months.
Use this if you are developing an LLM-based application and need a robust way to test, evaluate, and monitor its performance throughout its lifecycle.
Not ideal if you are looking for a pre-built, ready-to-deploy LLM application or if you lack basic Python and ML testing knowledge.
Stars
8
Forks
6
Language
Jupyter Notebook
License
MIT
Category
Last pushed
May 06, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/pyladiesams/eval-llm-based-apps-jan2025"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
eth-sri/matharena
Evaluation of LLMs on latest math competitions
tatsu-lab/alpaca_eval
An automatic evaluator for instruction-following language models. Human-validated, high-quality,...
HPAI-BSC/TuRTLe
TuRTLe: A Unified Evaluation of LLMs for RTL Generation 🐢 (MLCAD 2025)
nlp-uoregon/mlmm-evaluation
Multilingual Large Language Models Evaluation Benchmark
haesleinhuepf/human-eval-bia
Benchmarking Large Language Models for Bio-Image Analysis Code Generation