LLM-Tuning-Safety/LLMs-Finetuning-Safety

We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20 via OpenAI’s APIs.

42
/ 100
Emerging

This project helps developers and researchers understand how fine-tuning large language models (LLMs) like GPT-3.5 Turbo can unintentionally reduce their safety. It demonstrates that even a small, seemingly innocuous dataset used for fine-tuning can lead to models generating harmful content. The project takes various fine-tuning datasets as input and outputs insights into the increased harmfulness scores of the resulting models.

344 stars. No commits in the last 6 months.

Use this if you are a developer or researcher working with LLMs and need to understand the potential safety risks and unintended consequences of fine-tuning, even with benign data.

Not ideal if you are looking for a direct solution to fix or prevent LLM safety degradation, as this project focuses on demonstrating and analyzing the problem.

AI Safety Research LLM Development Model Fine-tuning Responsible AI AI Security
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 16 / 25

How are scores calculated?

Stars

344

Forks

35

Language

Python

License

MIT

Last pushed

Feb 23, 2024

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/LLM-Tuning-Safety/LLMs-Finetuning-Safety"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.