mintaywon/IF_RLHF
Source code for 'Understanding impacts of human feedback via influence functions'
This project helps evaluate how specific human feedback data influences the behavior and biases of large language models, particularly reward models. It takes datasets with human feedback (like preference rankings) and outputs insights into which feedback instances most contribute to biases like favoring longer responses or sycophantic replies. This is for AI researchers or model developers who want to understand and mitigate biases in their AI systems.
No commits in the last 6 months.
Use this if you need to pinpoint exactly which pieces of human feedback are causing your reward model to develop undesirable biases.
Not ideal if you are looking for a plug-and-play solution to automatically de-bias a live AI system or if you are not working with large language models and human feedback data.
Stars
10
Forks
1
Language
Python
License
MIT
Category
Last pushed
Feb 02, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/mintaywon/IF_RLHF"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
agentscope-ai/Trinity-RFT
Trinity-RFT is a general-purpose, flexible and scalable framework designed for reinforcement...
OpenRLHF/OpenRLHF
An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO &...
zjunlp/EasyEdit
[ACL 2024] An Easy-to-use Knowledge Editing Framework for LLMs.
huggingface/alignment-handbook
Robust recipes to align language models with human and AI preferences
hyunwoongko/nanoRLHF
nanoRLHF: from-scratch journey into how LLMs and RLHF really work.