SecHack365-Fans/prompt2slip
This library is testing the ethics of language models by using natural adversarial texts.
This library helps evaluate the ethical boundaries and potential risks of large language models by finding prompts that cause them to generate specific, often undesirable, target words. You provide a language model and a list of target words, and it outputs adversarial text that forces the model to include those words. This is useful for AI safety researchers, ethics auditors, and machine learning engineers responsible for deploying responsible language AI.
No commits in the last 6 months.
Use this if you need to systematically test how easily a language model can be manipulated into producing specific, potentially harmful, or off-topic words or phrases through adversarial prompting.
Not ideal if you are looking for a general-purpose tool to improve language model performance or fine-tune models for specific tasks.
Stars
9
Forks
2
Language
Python
License
MIT
Category
Last pushed
Dec 04, 2021
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/SecHack365-Fans/prompt2slip"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
thunlp/OpenAttack
An Open-Source Package for Textual Adversarial Attack.
thunlp/TAADpapers
Must-read Papers on Textual Adversarial Attack and Defense
jind11/TextFooler
A Model for Natural Language Attack on Text Classification and Inference
thunlp/OpenBackdoor
An open-source toolkit for textual backdoor attack and defense (NeurIPS 2022 D&B, Spotlight)
thunlp/SememePSO-Attack
Code and data of the ACL 2020 paper "Word-level Textual Adversarial Attacking as Combinatorial...