mintaywon/IF_RLHF

Source code for 'Understanding impacts of human feedback via influence functions'

28
/ 100
Experimental

This project helps evaluate how specific human feedback data influences the behavior and biases of large language models, particularly reward models. It takes datasets with human feedback (like preference rankings) and outputs insights into which feedback instances most contribute to biases like favoring longer responses or sycophantic replies. This is for AI researchers or model developers who want to understand and mitigate biases in their AI systems.

No commits in the last 6 months.

Use this if you need to pinpoint exactly which pieces of human feedback are causing your reward model to develop undesirable biases.

Not ideal if you are looking for a plug-and-play solution to automatically de-bias a live AI system or if you are not working with large language models and human feedback data.

AI bias detection Large Language Model evaluation Reinforcement Learning from Human Feedback Machine Learning interpretability AI ethics
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 5 / 25
Maturity 16 / 25
Community 7 / 25

How are scores calculated?

Stars

10

Forks

1

Language

Python

License

MIT

Last pushed

Feb 02, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/mintaywon/IF_RLHF"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.