yueliu1999/FlipAttack
[ICML 2025] An official source code for paper "FlipAttack: Jailbreak LLMs via Flipping".
This project offers a method to test the robustness of large language models (LLMs) against 'jailbreak' attempts. It takes a potentially harmful prompt and modifies it by adding specific 'noise' to the beginning, which tricks the LLM into generating undesirable content that its safety guardrails would normally prevent. This is designed for AI safety researchers or red teamers who need to evaluate how easily LLMs can be manipulated.
167 stars. No commits in the last 6 months.
Use this if you need to rigorously test and identify vulnerabilities in the safety mechanisms of black-box LLMs by seeing how they respond to cleverly disguised harmful prompts.
Not ideal if you are looking for a tool to develop or improve LLM safety filters, as this focuses on attacking existing ones.
Stars
167
Forks
13
Language
Python
License
—
Category
Last pushed
May 02, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/yueliu1999/FlipAttack"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
UCSB-NLP-Chang/SemanticSmooth
Implementation of paper 'Defending Large Language Models against Jailbreak Attacks via Semantic...
sigeisler/reinforce-attacks-llms
REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and...
DAMO-NLP-SG/multilingual-safety-for-LLMs
[ICLR 2024]Data for "Multilingual Jailbreak Challenges in Large Language Models"
vicgalle/merging-self-critique-jailbreaks
"Merging Improves Self-Critique Against Jailbreak Attacks", code and models
wanglne/DELMAN
[ACL 2025 Findings] DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing