ilinguistics/corpus_similarity

Measure the similarity of text corpora for 74 languages

35
/ 100
Emerging

This tool helps linguists, researchers, or anyone working with large text collections quickly understand how similar two bodies of text are. You provide two text datasets, and it returns a single number between 0 and 1, indicating how different or similar they are. This is useful for analyzing language use across different contexts or time periods for 74 supported languages.

No commits in the last 6 months.

Use this if you need to objectively compare the linguistic content of two large text collections (at least 10,000 words each) in one of 74 supported languages.

Not ideal if you need to compare very short texts or want to compare similarity across different languages, as the scores are only consistent within a single language.

linguistic-analysis corpus-comparison text-research computational-linguistics language-variation
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 5 / 25
Maturity 16 / 25
Community 14 / 25

How are scores calculated?

Stars

14

Forks

3

Language

Python

License

GPL-3.0

Last pushed

Jan 26, 2024

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/ilinguistics/corpus_similarity"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.