nlx-group/overlapy

Python package developed to evaluate textual overlap (N-Grams) between two volumes of text.

43
/ 100
Emerging

When training large language models, this tool helps evaluate if your pre-training data contains parts of your test datasets. It takes in a pre-training dataset and one or more test datasets, then identifies shared text sequences (N-Grams). This ensures your language model is tested on truly unseen data, giving you a more accurate evaluation of its performance.

No commits in the last 6 months. Available on PyPI.

Use this if you are a machine learning researcher or engineer developing and evaluating large language models and need to ensure the integrity of your model's test results.

Not ideal if you need to compare document similarity or plagiarism for general text analysis tasks outside of language model data contamination.

Language Model Training NLP Dataset Curation Machine Learning Evaluation Data Contamination Analysis Natural Language Processing
Stale 6m No Dependents
Maintenance 0 / 25
Adoption 5 / 25
Maturity 25 / 25
Community 13 / 25

How are scores calculated?

Stars

10

Forks

2

Language

Python

License

MIT

Last pushed

Sep 23, 2021

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/nlx-group/overlapy"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.