shreyansh26/Red-Teaming-Language-Models-with-Language-Models

A re-implementation of the "Red Teaming Language Models with Language Models" paper by Perez et al., 2022

/ 100

Experimental

This project helps AI safety researchers and model developers proactively identify and mitigate harmful outputs from large language models. It takes various large language models and automatically generates potential 'red-team' questions designed to elicit toxic or offensive responses. The output consists of a dataset of questions, the model's answers, and a toxicity score for each interaction, allowing for evaluation of model safety and robustness.

No commits in the last 6 months.

Use this if you are a language model developer or an AI safety researcher needing to automatically test and evaluate your models for potential toxic language generation before deployment.

Not ideal if you need a comprehensive red-teaming solution that covers a broader range of risks beyond just toxic and offensive language.

AI Safety Language Model Evaluation Harmful Content Detection Responsible AI Model Risk Assessment

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 7 / 25

Maturity 8 / 25

Community 8 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

zealscott/AutoProfiler

Source code for Automated Profile Inference with Language Model Agents

leondz/lm_risk_cards

Risks and targets for assessing LLMs & LLM vulnerabilities

RedTeamingforLLMs/RedTeamingforLLMs

A framework designed for executing positive red-teaming experiments on large language models.

dan0nchik/llm-attack-kit

A collection of LLM attacks

Explore Transformer Models

All categories Trending Transformer directory Insights