master/spark-stemming
Spark MLlib wrapper for the Snowball framework
This tool helps data engineers and data scientists clean up text data by reducing words to their root form across many languages. It takes in raw text, often as part of a larger data processing workflow, and outputs a version where inflected words like "running," "ran," and "runs" all become "run." This is especially useful for anyone building search engines, recommendation systems, or sentiment analysis tools.
No commits in the last 6 months.
Use this if you are processing large volumes of text data in Apache Spark and need to standardize words to their base form to improve analysis or search accuracy.
Not ideal if you are working with text in a language not supported by the Snowball framework or if you don't use Apache Spark for your data processing.
Stars
34
Forks
20
Language
Java
License
BSD-2-Clause
Category
Last pushed
Nov 27, 2018
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/master/spark-stemming"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
hplt-project/sacremoses
Python port of Moses tokenizer, truecaser and normalizer
Blake-Madden/OleanderStemmingLibrary
Porter stemming library (C++)
adbar/simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
htaghizadeh/PersianStemmer-Python
PersianStemmer-Python
michmech/lemmatization-lists
Machine-readable lists of lemma-token pairs in 23 languages.