PKU-Alignment/llms-resist-alignment

[ACL2025 Best Paper] Language Models Resist Alignment

/ 100

Experimental

This research explores why large language models (LLMs) often revert to their original, pre-training behaviors even after being fine-tuned for specific, desirable outcomes like safety or helpfulness. It investigates the 'elasticity' of LLMs, where alignment efforts can be superficial and easily undone, allowing harmful or unintended behaviors to resurface. The project offers insights and experimental evidence for AI researchers, ethical AI developers, and anyone involved in building or deploying LLMs where robust and lasting behavioral control is critical.

No commits in the last 6 months.

Use this if you are a researcher or practitioner developing or fine-tuning large language models and need to understand the fundamental mechanisms that make alignment efforts fragile and potentially reversible.

Not ideal if you are looking for ready-to-use tools or code to directly implement or improve model alignment in production systems without deep theoretical exploration.

AI-safety LLM-alignment model-robustness AI-ethics generative-AI-research

No License Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 8 / 25

Maturity 8 / 25

Community 3 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

agentscope-ai/Trinity-RFT

Trinity-RFT is a general-purpose, flexible and scalable framework designed for reinforcement...

OpenRLHF/OpenRLHF

An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO &...

zjunlp/EasyEdit

[ACL 2024] An Easy-to-use Knowledge Editing Framework for LLMs.

huggingface/alignment-handbook

Robust recipes to align language models with human and AI preferences

hyunwoongko/nanoRLHF

nanoRLHF: from-scratch journey into how LLMs and RLHF really work.

Explore Transformer Models

All categories Trending Transformer directory Insights