yyy01/PAC
The official implementation of the paper "Data Contamination Calibration for Black-box LLMs" (ACL 2024)
This project helps AI researchers and practitioners identify if specific data was included in a Large Language Model's (LLM) training set. You provide a dataset of text snippets, and it tells you which ones might have contaminated the LLM. This is for anyone working to ensure the integrity and privacy of LLMs.
No commits in the last 6 months.
Use this if you need to detect 'data contamination' in black-box or white-box Large Language Models, verifying if specific text data was used in their training.
Not ideal if you are looking for a general-purpose data cleaning tool unrelated to LLM training data integrity.
Stars
16
Forks
1
Language
Python
License
MIT
Category
Last pushed
May 21, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/yyy01/PAC"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
MadryLab/context-cite
Attribute (or cite) statements generated by LLMs back to in-context information.
microsoft/augmented-interpretable-models
Interpretable and efficient predictors using pre-trained language models. Scikit-learn compatible.
Trustworthy-ML-Lab/CB-LLMs
[ICLR 25] A novel framework for building intrinsically interpretable LLMs with...
poloclub/LLM-Attributor
LLM Attributor: Attribute LLM's Generated Text to Training Data
THUDM/LongCite
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA