chziakas/redeval
A library for red-teaming LLM applications with LLMs.
Before deploying a Large Language Model (LLM) in a real-world application, this tool helps you find its weaknesses and potential failure points. It takes your LLM application and automatically tests it against various simulated scenarios, producing detailed reports on how it performed. This is for anyone responsible for the safety, reliability, or performance of an LLM-powered product, such as an AI product manager, an ethics and safety specialist, or an operations engineer.
No commits in the last 6 months.
Use this if you need to thoroughly test an LLM application for vulnerabilities like manipulation, deception, or generating toxic content before it interacts with actual users.
Not ideal if you need to evaluate an LLM's raw academic benchmark performance or if your application doesn't involve conversational interactions.
Stars
29
Forks
5
Language
Python
License
Apache-2.0
Category
Last pushed
Oct 11, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/chziakas/redeval"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
eth-sri/matharena
Evaluation of LLMs on latest math competitions
tatsu-lab/alpaca_eval
An automatic evaluator for instruction-following language models. Human-validated, high-quality,...
HPAI-BSC/TuRTLe
TuRTLe: A Unified Evaluation of LLMs for RTL Generation 🐢 (MLCAD 2025)
nlp-uoregon/mlmm-evaluation
Multilingual Large Language Models Evaluation Benchmark
haesleinhuepf/human-eval-bia
Benchmarking Large Language Models for Bio-Image Analysis Code Generation