boyiwei/alignment-attribution-code
[ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
This tool helps AI safety researchers and model developers evaluate the robustness of safety features in large language models like Llama 2. It takes a pre-trained, safety-aligned LLM and a dataset of safety-critical prompts as input. The output helps users understand how easily a model's safety alignment can be degraded by making small, targeted modifications to its internal structure.
No commits in the last 6 months.
Use this if you need to rigorously test and understand the brittleness of safety alignment in your large language models, specifically by analyzing the impact of pruning or low-rank modifications on their safety performance and general utility.
Not ideal if you are looking for a general-purpose model pruning tool to optimize inference speed or reduce model size, without a primary focus on evaluating safety alignment.
Stars
89
Forks
17
Language
Python
License
MIT
Category
Last pushed
Mar 30, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/boyiwei/alignment-attribution-code"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
steering-vectors/steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
jianghoucheng/AlphaEdit
AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models, ICLR 2025 (Outstanding Paper)
kmeng01/memit
Mass-editing thousands of facts into a transformer memory (ICLR 2023)
jianghoucheng/AnyEdit
AnyEdit: Edit Any Knowledge Encoded in Language Models, ICML 2025
zjunlp/KnowledgeCircuits
[NeurIPS 2024] Knowledge Circuits in Pretrained Transformers