boyiwei/alignment-attribution-code

[ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

/ 100

Emerging

This tool helps AI safety researchers and model developers evaluate the robustness of safety features in large language models like Llama 2. It takes a pre-trained, safety-aligned LLM and a dataset of safety-critical prompts as input. The output helps users understand how easily a model's safety alignment can be degraded by making small, targeted modifications to its internal structure.

No commits in the last 6 months.

Use this if you need to rigorously test and understand the brittleness of safety alignment in your large language models, specifically by analyzing the impact of pruning or low-rank modifications on their safety performance and general utility.

Not ideal if you are looking for a general-purpose model pruning tool to optimize inference speed or reduce model size, without a primary focus on evaluating safety alignment.

AI Safety Large Language Models Model Evaluation Alignment Research Interpretability

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 9 / 25

Maturity 16 / 25

Community 19 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Higher-rated alternatives

steering-vectors/steering-vectors

Steering vectors for transformer language models in Pytorch / Huggingface

jianghoucheng/AlphaEdit

AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models, ICLR 2025 (Outstanding Paper)

kmeng01/memit

Mass-editing thousands of facts into a transformer memory (ICLR 2023)

jianghoucheng/AnyEdit

AnyEdit: Edit Any Knowledge Encoded in Language Models, ICML 2025

zjunlp/KnowledgeCircuits

[NeurIPS 2024] Knowledge Circuits in Pretrained Transformers

Explore Transformer Models

All categories Trending Transformer directory Insights