michmech/lemmatization-lists

Machine-readable lists of lemma-token pairs in 23 languages.

/ 100

Established

This project provides extensive, pre-compiled lists of words in over 20 languages, linking a base word (lemma) to all its different forms (tokens). For example, if you search for 'walk', these lists can ensure your search also finds 'walking' and 'walked'. This helps improve the accuracy and completeness of text searches and other language-based tasks for anyone working with multilingual content.

361 stars. No commits in the last 6 months.

Use this if you need to improve the recall of your search engine or text analysis by finding all grammatical variations of a word across multiple languages.

Not ideal if you require real-time lemmatization for highly dynamic or custom text inputs, or if you need to generate new lemma-token pairs rather than use existing lists.

multilingual-search text-processing information-retrieval natural-language-processing digital-humanities

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 24 / 25

How are scores calculated?

Stars

361

Forks

Language

—

License

ODbL-1.0

Related tools

hplt-project/sacremoses

Python port of Moses tokenizer, truecaser and normalizer

Blake-Madden/OleanderStemmingLibrary

Porter stemming library (C++)

adbar/simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

htaghizadeh/PersianStemmer-Python

PersianStemmer-Python

winkjs/wink-porter2-stemmer

Javascript Implementation of Porter Stemmer Algorithm V2 by Dr Martin F Porter

Explore NLP Tools

All categories Trending NLP directory Insights