hplt-project/sacremoses
Python port of Moses tokenizer, truecaser and normalizer
This tool helps prepare large amounts of text for natural language processing tasks by breaking sentences into individual words or symbols, restoring original capitalization, and standardizing punctuation. It takes raw text or already tokenized words as input and outputs cleaned, consistently formatted text suitable for analysis or machine translation. Anyone working with text data, such as linguists, data scientists, or researchers in computational linguistics, would find this useful.
495 stars. Used by 32 other packages. Available on PyPI.
Use this if you need to reliably clean, tokenize, detokenize, or truecase text data for consistent processing across different language models or analytical workflows.
Not ideal if you only need basic text manipulation like simple string replacement or if you're working with very small, non-linguistic datasets.
Stars
495
Forks
60
Language
Python
License
MIT
Category
Last pushed
Feb 06, 2026
Commits (30d)
0
Dependencies
4
Reverse dependents
32
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/hplt-project/sacremoses"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
Blake-Madden/OleanderStemmingLibrary
Porter stemming library (C++)
adbar/simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
htaghizadeh/PersianStemmer-Python
PersianStemmer-Python
michmech/lemmatization-lists
Machine-readable lists of lemma-token pairs in 23 languages.
winkjs/wink-porter2-stemmer
Javascript Implementation of Porter Stemmer Algorithm V2 by Dr Martin F Porter