michmech/lemmatization-lists
Machine-readable lists of lemma-token pairs in 23 languages.
This project provides extensive, pre-compiled lists of words in over 20 languages, linking a base word (lemma) to all its different forms (tokens). For example, if you search for 'walk', these lists can ensure your search also finds 'walking' and 'walked'. This helps improve the accuracy and completeness of text searches and other language-based tasks for anyone working with multilingual content.
361 stars. No commits in the last 6 months.
Use this if you need to improve the recall of your search engine or text analysis by finding all grammatical variations of a word across multiple languages.
Not ideal if you require real-time lemmatization for highly dynamic or custom text inputs, or if you need to generate new lemma-token pairs rather than use existing lists.
Stars
361
Forks
98
Language
—
License
ODbL-1.0
Category
Last pushed
Jan 29, 2022
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/michmech/lemmatization-lists"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
hplt-project/sacremoses
Python port of Moses tokenizer, truecaser and normalizer
Blake-Madden/OleanderStemmingLibrary
Porter stemming library (C++)
adbar/simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
htaghizadeh/PersianStemmer-Python
PersianStemmer-Python
winkjs/wink-porter2-stemmer
Javascript Implementation of Porter Stemmer Algorithm V2 by Dr Martin F Porter