tmlr-group/DeepInception
[arXiv:2311.03191] "DeepInception: Hypnotize Large Language Model to Be Jailbreaker"
This project helps AI safety researchers and red teamers evaluate the robustness of large language models (LLMs) against sophisticated manipulation. It takes a carefully constructed prompt that "hypnotizes" the LLM and reveals its hidden vulnerabilities, outputting a clear demonstration of how the LLM can be coaxed into generating harmful or restricted content. Anyone responsible for assessing and improving the safety of AI systems would use this tool.
172 stars. No commits in the last 6 months.
Use this if you need to systematically test and identify critical weaknesses in LLM safety guardrails by leveraging an LLM's personification abilities to create nested, deceptive scenarios.
Not ideal if you are looking for a simple, brute-force method to bypass LLM restrictions, as this tool focuses on a more advanced, context-aware approach.
Stars
172
Forks
20
Language
Python
License
MIT
Category
Last pushed
Feb 20, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/tmlr-group/DeepInception"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
xirui-li/DrAttack
Official implementation of paper: DrAttack: Prompt Decomposition and Reconstruction Makes...
UCSB-NLP-Chang/SemanticSmooth
Implementation of paper 'Defending Large Language Models against Jailbreak Attacks via Semantic...
sigeisler/reinforce-attacks-llms
REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and...
DAMO-NLP-SG/multilingual-safety-for-LLMs
[ICLR 2024]Data for "Multilingual Jailbreak Challenges in Large Language Models"
erfanshayegani/Jailbreak-In-Pieces
[ICLR 2024 Spotlight 🔥 ] - [ Best Paper Award SoCal NLP 2023 🏆] - Jailbreak in pieces:...