ilinguistics/corpus_similarity

Measure the similarity of text corpora for 74 languages

/ 100

Emerging

This tool helps linguists, researchers, or anyone working with large text collections quickly understand how similar two bodies of text are. You provide two text datasets, and it returns a single number between 0 and 1, indicating how different or similar they are. This is useful for analyzing language use across different contexts or time periods for 74 supported languages.

No commits in the last 6 months.

Use this if you need to objectively compare the linguistic content of two large text collections (at least 10,000 words each) in one of 74 supported languages.

Not ideal if you need to compare very short texts or want to compare similarity across different languages, as the scores are only consistent within a single language.

linguistic-analysis corpus-comparison text-research computational-linguistics language-variation

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 5 / 25

Maturity 16 / 25

Community 14 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

GPL-3.0

Higher-rated alternatives

Helsinki-NLP/OpusFilter

OpusFilter - Parallel corpus processing toolkit

natasha/corus

Links to Russian corpora + Python functions for loading and parsing

darija-open-dataset/dataset

darija <-> english dataset

omicsNLP/Auto-CORPus

Auto-CORPus pipeline developed by a University of Nottingham and Imperial College London...

SergeyShk/ruTS

Библиотека для извлечения статистик из текстов на русском языке.

Explore NLP Tools

All categories Trending NLP directory Insights