nlx-group/overlapy

Python package developed to evaluate textual overlap (N-Grams) between two volumes of text.

/ 100

Emerging

When training large language models, this tool helps evaluate if your pre-training data contains parts of your test datasets. It takes in a pre-training dataset and one or more test datasets, then identifies shared text sequences (N-Grams). This ensures your language model is tested on truly unseen data, giving you a more accurate evaluation of its performance.

No commits in the last 6 months. Available on PyPI.

Use this if you are a machine learning researcher or engineer developing and evaluating large language models and need to ensure the integrity of your model's test results.

Not ideal if you need to compare document similarity or plagiarism for general text analysis tasks outside of language model data contamination.

Language Model Training NLP Dataset Curation Machine Learning Evaluation Data Contamination Analysis Natural Language Processing

Stale 6m No Dependents

Maintenance 0 / 25

Adoption 5 / 25

Maturity 25 / 25

Community 13 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Related tools

joshualoehr/ngram-language-model

Python implementation of an N-gram language model with Laplace smoothing and sentence generation.

MannarAmuthan/kural-gen

KuralGen generates Thirukkural for a given English sentence

phughesmcr/SimpleNGrams

The easiest way to get n-grams from strings!

SpydazWebAI-NLP/BasicLanguageModelling2023

Basic Language Models , Bag of Words, Ngram Models Etc NLP modelling and associated tasks

simrann20/Hangman_Game_Project

Hangman Game implementation using n-gram language model in NLP, achieved an accuracy of more than 50%

Explore NLP Tools

All categories Trending NLP directory Insights