hplt-project/sacremoses

Python port of Moses tokenizer, truecaser and normalizer

/ 100

Established

This tool helps prepare large amounts of text for natural language processing tasks by breaking sentences into individual words or symbols, restoring original capitalization, and standardizing punctuation. It takes raw text or already tokenized words as input and outputs cleaned, consistently formatted text suitable for analysis or machine translation. Anyone working with text data, such as linguists, data scientists, or researchers in computational linguistics, would find this useful.

495 stars. Used by 32 other packages. Available on PyPI.

Use this if you need to reliably clean, tokenize, detokenize, or truecase text data for consistent processing across different language models or analytical workflows.

Not ideal if you only need basic text manipulation like simple string replacement or if you're working with very small, non-linguistic datasets.

natural-language-processing computational-linguistics text-preparation machine-translation data-cleaning

Maintenance 10 / 25

Adoption 15 / 25

Maturity 25 / 25

Community 18 / 25

How are scores calculated?

Stars

495

Forks

Language

Python

License

MIT

Related tools

Blake-Madden/OleanderStemmingLibrary

Porter stemming library (C++)

adbar/simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

htaghizadeh/PersianStemmer-Python

PersianStemmer-Python

michmech/lemmatization-lists

Machine-readable lists of lemma-token pairs in 23 languages.

winkjs/wink-porter2-stemmer

Javascript Implementation of Porter Stemmer Algorithm V2 by Dr Martin F Porter

Explore NLP Tools

All categories Trending NLP directory Insights