declare-lab/red-instruct
Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
This project helps evaluate how safely large language models (LLMs) respond to harmful questions, using different prompt styles to test their safety guardrails. You provide a set of potentially harmful questions and specific prompt templates, and it generates model responses. The outcome is a safety score, known as Attack Success Rate (ASR), indicating how easily an LLM can be prompted to give an unsafe answer. This tool is for AI safety researchers and developers who need to rigorously test and improve the safety of their LLMs.
108 stars. No commits in the last 6 months.
Use this if you need to systematically assess and benchmark the safety of various large language models against known harmful queries and red-teaming techniques.
Not ideal if you are looking for a simple, non-technical tool for general content moderation or for testing a single model without deep technical analysis.
Stars
108
Forks
13
Language
Python
License
Apache-2.0
Category
Last pushed
Mar 08, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/declare-lab/red-instruct"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
PaddlePaddle/PaddleNLP
Easy-to-use and powerful LLM and SLM library with awesome model zoo.
meta-llama/llama-cookbook
Welcome to the Llama Cookbook! This is your go to guide for Building with Llama: Getting started...
arcee-ai/mergekit
Tools for merging pretrained large language models.
changyeyu/LLM-RL-Visualized
๐100+ ๅๅ LLM / RL ๅ็ๅพ๐๏ผใๅคงๆจกๅ็ฎๆณใไฝ่ ๅทจ็ฎ๏ผ๐ฅ๏ผ100+ LLM/RL Algorithm Maps ๏ผ
mindspore-lab/step_into_llm
MindSpore online courses: Step into LLM