master/spark-stemming

Spark MLlib wrapper for the Snowball framework

/ 100

Emerging

This tool helps data engineers and data scientists clean up text data by reducing words to their root form across many languages. It takes in raw text, often as part of a larger data processing workflow, and outputs a version where inflected words like "running," "ran," and "runs" all become "run." This is especially useful for anyone building search engines, recommendation systems, or sentiment analysis tools.

No commits in the last 6 months.

Use this if you are processing large volumes of text data in Apache Spark and need to standardize words to their base form to improve analysis or search accuracy.

Not ideal if you are working with text in a language not supported by the Snowball framework or if you don't use Apache Spark for your data processing.

information-retrieval natural-language-processing text-analytics data-preprocessing search-engine-optimization

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 7 / 25

Maturity 16 / 25

Community 19 / 25

How are scores calculated?

Stars

Forks

Language

Java

License

BSD-2-Clause

Higher-rated alternatives

hplt-project/sacremoses

Python port of Moses tokenizer, truecaser and normalizer

Blake-Madden/OleanderStemmingLibrary

Porter stemming library (C++)

adbar/simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

htaghizadeh/PersianStemmer-Python

PersianStemmer-Python

michmech/lemmatization-lists

Machine-readable lists of lemma-token pairs in 23 languages.

Explore NLP Tools

All categories Trending NLP directory Insights