AetherPrior/TrickLLM

This repository contains the code for the paper "Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks" by Abhinav Rao, Sachin Vashishta*, Atharva Naik*, Somak Aditya, and Monojit Choudhury, accepted at LREC-CoLING 2024

33
/ 100
Emerging

This tool helps AI safety researchers and red teamers understand how Large Language Models (LLMs) can be manipulated to produce unwanted or harmful content. It takes various 'jailbreak' prompts and base prompts as input, runs them against different LLMs (like GPT-based models, OPT, BLOOM, FLAN-T5-XXL), and provides detailed analysis and success rates of these attacks. The output helps in formalizing, analyzing, and detecting such deceptive behaviors.

No commits in the last 6 months.

Use this if you are an AI safety researcher or practitioner focused on understanding, evaluating, and mitigating prompt injection and 'jailbreaking' vulnerabilities in large language models.

Not ideal if you are looking for a simple, out-of-the-box solution for content moderation or directly applying fixes to a production LLM without deep analysis of attack vectors.

AI Safety LLM Vulnerabilities Red Teaming Prompt Engineering Content Moderation Research
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 4 / 25
Maturity 16 / 25
Community 13 / 25

How are scores calculated?

Stars

8

Forks

2

Language

Jupyter Notebook

License

AGPL-3.0

Last pushed

May 22, 2024

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/AetherPrior/TrickLLM"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.