xirui-li/DrAttack
Official implementation of paper: DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
This project helps security researchers and red teamers evaluate the robustness of large language models (LLMs) against adversarial prompts. It takes a potentially harmful prompt and breaks it down, reconstructs it with subtle changes, and searches for synonyms to create 'jailbreak' prompts. The output is a highly effective adversarial prompt designed to bypass LLM safety mechanisms.
No commits in the last 6 months.
Use this if you are a security researcher or red teamer needing to rigorously test the safety alignment of LLMs like GPT-4, Gemini, or Llama2.
Not ideal if you are looking for a general-purpose LLM prompt engineering tool or a method to bypass ethical guidelines for malicious intent.
Stars
66
Forks
13
Language
JavaScript
License
MIT
Category
Last pushed
Aug 25, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/xirui-li/DrAttack"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related models
tmlr-group/DeepInception
[arXiv:2311.03191] "DeepInception: Hypnotize Large Language Model to Be Jailbreaker"
UCSB-NLP-Chang/SemanticSmooth
Implementation of paper 'Defending Large Language Models against Jailbreak Attacks via Semantic...
sigeisler/reinforce-attacks-llms
REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and...
DAMO-NLP-SG/multilingual-safety-for-LLMs
[ICLR 2024]Data for "Multilingual Jailbreak Challenges in Large Language Models"
erfanshayegani/Jailbreak-In-Pieces
[ICLR 2024 Spotlight 🔥 ] - [ Best Paper Award SoCal NLP 2023 🏆] - Jailbreak in pieces:...