PKU-Alignment/llms-resist-alignment

[ACL2025 Best Paper] Language Models Resist Alignment

21
/ 100
Experimental

This research explores why large language models (LLMs) often revert to their original, pre-training behaviors even after being fine-tuned for specific, desirable outcomes like safety or helpfulness. It investigates the 'elasticity' of LLMs, where alignment efforts can be superficial and easily undone, allowing harmful or unintended behaviors to resurface. The project offers insights and experimental evidence for AI researchers, ethical AI developers, and anyone involved in building or deploying LLMs where robust and lasting behavioral control is critical.

No commits in the last 6 months.

Use this if you are a researcher or practitioner developing or fine-tuning large language models and need to understand the fundamental mechanisms that make alignment efforts fragile and potentially reversible.

Not ideal if you are looking for ready-to-use tools or code to directly implement or improve model alignment in production systems without deep theoretical exploration.

AI-safety LLM-alignment model-robustness AI-ethics generative-AI-research
No License Stale 6m No Package No Dependents
Maintenance 2 / 25
Adoption 8 / 25
Maturity 8 / 25
Community 3 / 25

How are scores calculated?

Stars

44

Forks

1

Language

Python

License

Last pushed

Jun 11, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/PKU-Alignment/llms-resist-alignment"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.