AetherPrior/TrickLLM
This repository contains the code for the paper "Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks" by Abhinav Rao, Sachin Vashishta*, Atharva Naik*, Somak Aditya, and Monojit Choudhury, accepted at LREC-CoLING 2024
This tool helps AI safety researchers and red teamers understand how Large Language Models (LLMs) can be manipulated to produce unwanted or harmful content. It takes various 'jailbreak' prompts and base prompts as input, runs them against different LLMs (like GPT-based models, OPT, BLOOM, FLAN-T5-XXL), and provides detailed analysis and success rates of these attacks. The output helps in formalizing, analyzing, and detecting such deceptive behaviors.
No commits in the last 6 months.
Use this if you are an AI safety researcher or practitioner focused on understanding, evaluating, and mitigating prompt injection and 'jailbreaking' vulnerabilities in large language models.
Not ideal if you are looking for a simple, out-of-the-box solution for content moderation or directly applying fixes to a production LLM without deep analysis of attack vectors.
Stars
8
Forks
2
Language
Jupyter Notebook
License
AGPL-3.0
Category
Last pushed
May 22, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/AetherPrior/TrickLLM"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
wuyoscar/ISC-Bench
Internal Safety Collapse: Turning LLMs into a "Jailbroken State" Without "a Jailbreak Attack".
yueliu1999/Awesome-Jailbreak-on-LLMs
Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, exciting jailbreak methods...
yiksiu-chan/SpeakEasy
[ICML 2025] Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions
xirui-li/DrAttack
Official implementation of paper: DrAttack: Prompt Decomposition and Reconstruction Makes...
tmlr-group/DeepInception
[arXiv:2311.03191] "DeepInception: Hypnotize Large Language Model to Be Jailbreaker"