shreyansh26/Red-Teaming-Language-Models-with-Language-Models

A re-implementation of the "Red Teaming Language Models with Language Models" paper by Perez et al., 2022

23
/ 100
Experimental

This project helps AI safety researchers and model developers proactively identify and mitigate harmful outputs from large language models. It takes various large language models and automatically generates potential 'red-team' questions designed to elicit toxic or offensive responses. The output consists of a dataset of questions, the model's answers, and a toxicity score for each interaction, allowing for evaluation of model safety and robustness.

No commits in the last 6 months.

Use this if you are a language model developer or an AI safety researcher needing to automatically test and evaluate your models for potential toxic language generation before deployment.

Not ideal if you need a comprehensive red-teaming solution that covers a broader range of risks beyond just toxic and offensive language.

AI Safety Language Model Evaluation Harmful Content Detection Responsible AI Model Risk Assessment
No License Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 7 / 25
Maturity 8 / 25
Community 8 / 25

How are scores calculated?

Stars

35

Forks

3

Language

Python

License

Last pushed

Oct 09, 2023

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/shreyansh26/Red-Teaming-Language-Models-with-Language-Models"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.