Praveengovianalytics/falcon-evaluate
Falcon Evaluate is an open-source Python library aims to revolutionise the LLM - RAG evaluation process by offering a low-code solution. Our goal is to make the evaluation process as seamless and efficient as possible, allowing you to focus on what truly matters.This library aims to provide an easy-to-use toolkit for assessing the performance, bias
When evaluating multiple large language models (LLMs) or retrieval-augmented generation (RAG) systems, this tool helps you compare their responses to a set of prompts and reference answers. It takes a table containing your prompts, correct answers, and each model's generated text, then outputs a detailed performance breakdown, including readability, toxicity, and similarity scores. This is ideal for AI product managers, data scientists, or researchers who need to quantify and understand the quality of their LLMs.
No commits in the last 6 months.
Use this if you need an easy way to compare the performance, bias, and general behavior of different LLMs or RAG systems using various metrics.
Not ideal if you are only evaluating a single model and don't require comparative analysis against multiple alternatives.
Stars
14
Forks
4
Language
Python
License
MIT
Category
Last pushed
Jan 31, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/Praveengovianalytics/falcon-evaluate"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
eth-sri/matharena
Evaluation of LLMs on latest math competitions
tatsu-lab/alpaca_eval
An automatic evaluator for instruction-following language models. Human-validated, high-quality,...
HPAI-BSC/TuRTLe
TuRTLe: A Unified Evaluation of LLMs for RTL Generation 🐢 (MLCAD 2025)
nlp-uoregon/mlmm-evaluation
Multilingual Large Language Models Evaluation Benchmark
haesleinhuepf/human-eval-bia
Benchmarking Large Language Models for Bio-Image Analysis Code Generation