declare-lab/red-instruct

Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment

/ 100

Emerging

This project helps evaluate how safely large language models (LLMs) respond to harmful questions, using different prompt styles to test their safety guardrails. You provide a set of potentially harmful questions and specific prompt templates, and it generates model responses. The outcome is a safety score, known as Attack Success Rate (ASR), indicating how easily an LLM can be prompted to give an unsafe answer. This tool is for AI safety researchers and developers who need to rigorously test and improve the safety of their LLMs.

108 stars. No commits in the last 6 months.

Use this if you need to systematically assess and benchmark the safety of various large language models against known harmful queries and red-teaming techniques.

Not ideal if you are looking for a simple, non-technical tool for general content moderation or for testing a single model without deep technical analysis.

AI Safety LLM Evaluation Red Teaming Content Moderation Harmful Content Detection

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 9 / 25

Maturity 16 / 25

Community 14 / 25

How are scores calculated?

Stars

108

Forks

Language

Python

License

Apache-2.0

Higher-rated alternatives

PaddlePaddle/PaddleNLP

Easy-to-use and powerful LLM and SLM library with awesome model zoo.

meta-llama/llama-cookbook

Welcome to the Llama Cookbook! This is your go to guide for Building with Llama: Getting started...

arcee-ai/mergekit

Tools for merging pretrained large language models.

changyeyu/LLM-RL-Visualized

🌟100+ 原创 LLM / RL 原理图📚，《大模型算法》作者巨献！💥（100+ LLM/RL Algorithm Maps ）

mindspore-lab/step_into_llm

MindSpore online courses: Step into LLM

Explore Transformer Models

All categories Trending Transformer directory Insights