shreyansh26/Red-Teaming-Language-Models-with-Language-Models
A re-implementation of the "Red Teaming Language Models with Language Models" paper by Perez et al., 2022
This project helps AI safety researchers and model developers proactively identify and mitigate harmful outputs from large language models. It takes various large language models and automatically generates potential 'red-team' questions designed to elicit toxic or offensive responses. The output consists of a dataset of questions, the model's answers, and a toxicity score for each interaction, allowing for evaluation of model safety and robustness.
No commits in the last 6 months.
Use this if you are a language model developer or an AI safety researcher needing to automatically test and evaluate your models for potential toxic language generation before deployment.
Not ideal if you need a comprehensive red-teaming solution that covers a broader range of risks beyond just toxic and offensive language.
Stars
35
Forks
3
Language
Python
License
—
Category
Last pushed
Oct 09, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/shreyansh26/Red-Teaming-Language-Models-with-Language-Models"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
zealscott/AutoProfiler
Source code for Automated Profile Inference with Language Model Agents
leondz/lm_risk_cards
Risks and targets for assessing LLMs & LLM vulnerabilities
RedTeamingforLLMs/RedTeamingforLLMs
A framework designed for executing positive red-teaming experiments on large language models.
dan0nchik/llm-attack-kit
A collection of LLM attacks