mintaywon/IF_RLHF

Source code for 'Understanding impacts of human feedback via influence functions'

/ 100

Experimental

This project helps evaluate how specific human feedback data influences the behavior and biases of large language models, particularly reward models. It takes datasets with human feedback (like preference rankings) and outputs insights into which feedback instances most contribute to biases like favoring longer responses or sycophantic replies. This is for AI researchers or model developers who want to understand and mitigate biases in their AI systems.

No commits in the last 6 months.

Use this if you need to pinpoint exactly which pieces of human feedback are causing your reward model to develop undesirable biases.

Not ideal if you are looking for a plug-and-play solution to automatically de-bias a live AI system or if you are not working with large language models and human feedback data.

AI bias detection Large Language Model evaluation Reinforcement Learning from Human Feedback Machine Learning interpretability AI ethics

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 5 / 25

Maturity 16 / 25

Community 7 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Higher-rated alternatives

agentscope-ai/Trinity-RFT

Trinity-RFT is a general-purpose, flexible and scalable framework designed for reinforcement...

OpenRLHF/OpenRLHF

An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO &...

zjunlp/EasyEdit

[ACL 2024] An Easy-to-use Knowledge Editing Framework for LLMs.

huggingface/alignment-handbook

Robust recipes to align language models with human and AI preferences

hyunwoongko/nanoRLHF

nanoRLHF: from-scratch journey into how LLMs and RLHF really work.

Explore Transformer Models

All categories Trending Transformer directory Insights