lyy1994/awesome-data-contamination

The Paper List on Data Contamination for Large Language Models Evaluation.

42
/ 100
Emerging

This resource helps researchers and practitioners evaluate Large Language Models (LLMs) accurately by addressing the problem of "data contamination." It provides a curated list of research papers that analyze, prevent, or detect instances where LLMs might have inadvertently seen evaluation data during their training. Users can consult this list to understand how to ensure their LLM benchmarks reflect true model capabilities, not just memorization.

110 stars.

Use this if you are a researcher, data scientist, or engineer developing or evaluating large language models and need to understand, detect, or prevent data contamination that can skew performance metrics.

Not ideal if you are looking for a general introduction to LLMs or seeking pre-trained models, as this resource focuses specifically on the technical issue of data contamination in evaluation.

LLM evaluation model benchmarking data quality AI ethics machine learning research
No Package No Dependents
Maintenance 10 / 25
Adoption 9 / 25
Maturity 16 / 25
Community 7 / 25

How are scores calculated?

Stars

110

Forks

5

Language

License

MIT

Last pushed

Jan 29, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/lyy1994/awesome-data-contamination"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.