PKU-Alignment/llms-resist-alignment
[ACL2025 Best Paper] Language Models Resist Alignment
This research explores why large language models (LLMs) often revert to their original, pre-training behaviors even after being fine-tuned for specific, desirable outcomes like safety or helpfulness. It investigates the 'elasticity' of LLMs, where alignment efforts can be superficial and easily undone, allowing harmful or unintended behaviors to resurface. The project offers insights and experimental evidence for AI researchers, ethical AI developers, and anyone involved in building or deploying LLMs where robust and lasting behavioral control is critical.
No commits in the last 6 months.
Use this if you are a researcher or practitioner developing or fine-tuning large language models and need to understand the fundamental mechanisms that make alignment efforts fragile and potentially reversible.
Not ideal if you are looking for ready-to-use tools or code to directly implement or improve model alignment in production systems without deep theoretical exploration.
Stars
44
Forks
1
Language
Python
License
—
Category
Last pushed
Jun 11, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/PKU-Alignment/llms-resist-alignment"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
agentscope-ai/Trinity-RFT
Trinity-RFT is a general-purpose, flexible and scalable framework designed for reinforcement...
OpenRLHF/OpenRLHF
An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO &...
zjunlp/EasyEdit
[ACL 2024] An Easy-to-use Knowledge Editing Framework for LLMs.
huggingface/alignment-handbook
Robust recipes to align language models with human and AI preferences
hyunwoongko/nanoRLHF
nanoRLHF: from-scratch journey into how LLMs and RLHF really work.