nlx-group/overlapy
Python package developed to evaluate textual overlap (N-Grams) between two volumes of text.
When training large language models, this tool helps evaluate if your pre-training data contains parts of your test datasets. It takes in a pre-training dataset and one or more test datasets, then identifies shared text sequences (N-Grams). This ensures your language model is tested on truly unseen data, giving you a more accurate evaluation of its performance.
No commits in the last 6 months. Available on PyPI.
Use this if you are a machine learning researcher or engineer developing and evaluating large language models and need to ensure the integrity of your model's test results.
Not ideal if you need to compare document similarity or plagiarism for general text analysis tasks outside of language model data contamination.
Stars
10
Forks
2
Language
Python
License
MIT
Category
Last pushed
Sep 23, 2021
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/nlx-group/overlapy"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
joshualoehr/ngram-language-model
Python implementation of an N-gram language model with Laplace smoothing and sentence generation.
MannarAmuthan/kural-gen
KuralGen generates Thirukkural for a given English sentence
phughesmcr/SimpleNGrams
The easiest way to get n-grams from strings!
SpydazWebAI-NLP/BasicLanguageModelling2023
Basic Language Models , Bag of Words, Ngram Models Etc NLP modelling and associated tasks
simrann20/Hangman_Game_Project
Hangman Game implementation using n-gram language model in NLP, achieved an accuracy of more than 50%