LLM-Tuning-Safety/LLMs-Finetuning-Safety
We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20 via OpenAI’s APIs.
This project helps developers and researchers understand how fine-tuning large language models (LLMs) like GPT-3.5 Turbo can unintentionally reduce their safety. It demonstrates that even a small, seemingly innocuous dataset used for fine-tuning can lead to models generating harmful content. The project takes various fine-tuning datasets as input and outputs insights into the increased harmfulness scores of the resulting models.
344 stars. No commits in the last 6 months.
Use this if you are a developer or researcher working with LLMs and need to understand the potential safety risks and unintended consequences of fine-tuning, even with benign data.
Not ideal if you are looking for a direct solution to fix or prevent LLM safety degradation, as this project focuses on demonstrating and analyzing the problem.
Stars
344
Forks
35
Language
Python
License
MIT
Category
Last pushed
Feb 23, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/LLM-Tuning-Safety/LLMs-Finetuning-Safety"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
kyegomez/Sophia
Effortless plugin and play Optimizer to cut model training costs by 50%. New optimizer that is...
appier-research/robust-llm-finetunes
Accepted to NeurIPS 2025
uthmandevsec/Self-Distillation
🤖 Enable continual learning by reproducing the On-Policy Self-Distillation algorithm for robust...
jmcentire/apprentice
Train cheap models on expensive ones. Automatically. With receipts.
phonism/LLMNotes
LLM 学习笔记:Transformer 架构、强化学习 (RLHF/DPO/PPO)、分布式训练、推理优化。含完整数学推导与Slides。