shahriargolchin/time-travel-in-llms
The official repository for the paper entitled "Time Travel in LLMs: Tracing Data Contamination in Large Language Models."
This tool helps AI researchers and developers uncover if a large language model (LLM) has been trained on specific text data, a problem known as data contamination. It takes a segment of text and an LLM, then checks if the LLM can complete the text by replicating the full original instance. The output indicates whether the model shows signs of having seen that particular data during its training.
No commits in the last 6 months.
Use this if you need to determine whether a black-box large language model has been contaminated by specific training data.
Not ideal if you need to estimate the *amount* of contamination or attribute contamination to specific sources, as this tool focuses on detection.
Stars
12
Forks
4
Language
Python
License
Apache-2.0
Category
Last pushed
Jun 11, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/shahriargolchin/time-travel-in-llms"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
mlabonne/llm-datasets
Curated list of datasets and tools for post-training.
malteos/llm-datasets
A collection of datasets for language model pretraining including scripts for downloading,...
magpie-align/magpie
[ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your...
jd-coderepos/llms4subjects
The official SemEval 2025 Task 5 - LLMs4Subjects - Shared Task Dataset repository
willxxy/ECG-Bench
A Unified Framework for Benchmarking Generative Electrocardiogram-Language Models (ELMs)