chziakas/redeval

A library for red-teaming LLM applications with LLMs.

/ 100

Emerging

Before deploying a Large Language Model (LLM) in a real-world application, this tool helps you find its weaknesses and potential failure points. It takes your LLM application and automatically tests it against various simulated scenarios, producing detailed reports on how it performed. This is for anyone responsible for the safety, reliability, or performance of an LLM-powered product, such as an AI product manager, an ethics and safety specialist, or an operations engineer.

No commits in the last 6 months.

Use this if you need to thoroughly test an LLM application for vulnerabilities like manipulation, deception, or generating toxic content before it interacts with actual users.

Not ideal if you need to evaluate an LLM's raw academic benchmark performance or if your application doesn't involve conversational interactions.

AI-safety LLM-auditing product-testing risk-assessment conversational-AI

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 7 / 25

Maturity 16 / 25

Community 14 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

Apache-2.0

Higher-rated alternatives

eth-sri/matharena

Evaluation of LLMs on latest math competitions

tatsu-lab/alpaca_eval

An automatic evaluator for instruction-following language models. Human-validated, high-quality,...

HPAI-BSC/TuRTLe

TuRTLe: A Unified Evaluation of LLMs for RTL Generation 🐢 (MLCAD 2025)

nlp-uoregon/mlmm-evaluation

Multilingual Large Language Models Evaluation Benchmark

haesleinhuepf/human-eval-bia

Benchmarking Large Language Models for Bio-Image Analysis Code Generation

Explore Transformer Models

All categories Trending Transformer directory Insights