ilinguistics/corpus_similarity
Measure the similarity of text corpora for 74 languages
This tool helps linguists, researchers, or anyone working with large text collections quickly understand how similar two bodies of text are. You provide two text datasets, and it returns a single number between 0 and 1, indicating how different or similar they are. This is useful for analyzing language use across different contexts or time periods for 74 supported languages.
No commits in the last 6 months.
Use this if you need to objectively compare the linguistic content of two large text collections (at least 10,000 words each) in one of 74 supported languages.
Not ideal if you need to compare very short texts or want to compare similarity across different languages, as the scores are only consistent within a single language.
Stars
14
Forks
3
Language
Python
License
GPL-3.0
Category
Last pushed
Jan 26, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/ilinguistics/corpus_similarity"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
Helsinki-NLP/OpusFilter
OpusFilter - Parallel corpus processing toolkit
natasha/corus
Links to Russian corpora + Python functions for loading and parsing
darija-open-dataset/dataset
darija <-> english dataset
omicsNLP/Auto-CORPus
Auto-CORPus pipeline developed by a University of Nottingham and Imperial College London...
SergeyShk/ruTS
Библиотека для извлечения статистик из текстов на русском языке.