LLM-Tuning-Safety/LLMs-Finetuning-Safety

We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20 via OpenAI’s APIs.

/ 100

Emerging

This project helps developers and researchers understand how fine-tuning large language models (LLMs) like GPT-3.5 Turbo can unintentionally reduce their safety. It demonstrates that even a small, seemingly innocuous dataset used for fine-tuning can lead to models generating harmful content. The project takes various fine-tuning datasets as input and outputs insights into the increased harmfulness scores of the resulting models.

344 stars. No commits in the last 6 months.

Use this if you are a developer or researcher working with LLMs and need to understand the potential safety risks and unintended consequences of fine-tuning, even with benign data.

Not ideal if you are looking for a direct solution to fix or prevent LLM safety degradation, as this project focuses on demonstrating and analyzing the problem.

AI Safety Research LLM Development Model Fine-tuning Responsible AI AI Security

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 16 / 25

How are scores calculated?

Stars

344

Forks

Language

Python

License

MIT

Related tools

kyegomez/Sophia

Effortless plugin and play Optimizer to cut model training costs by 50%. New optimizer that is...

appier-research/robust-llm-finetunes

Accepted to NeurIPS 2025

uthmandevsec/Self-Distillation

🤖 Enable continual learning by reproducing the On-Policy Self-Distillation algorithm for robust...

jmcentire/apprentice

Train cheap models on expensive ones. Automatically. With receipts.

phonism/LLMNotes

LLM 学习笔记：Transformer 架构、强化学习 (RLHF/DPO/PPO)、分布式训练、推理优化。含完整数学推导与Slides。

Explore LLM Tools

All categories Trending LLM Tool directory Insights