hplt-project/sacremoses

Python port of Moses tokenizer, truecaser and normalizer

68
/ 100
Established

This tool helps prepare large amounts of text for natural language processing tasks by breaking sentences into individual words or symbols, restoring original capitalization, and standardizing punctuation. It takes raw text or already tokenized words as input and outputs cleaned, consistently formatted text suitable for analysis or machine translation. Anyone working with text data, such as linguists, data scientists, or researchers in computational linguistics, would find this useful.

495 stars. Used by 32 other packages. Available on PyPI.

Use this if you need to reliably clean, tokenize, detokenize, or truecase text data for consistent processing across different language models or analytical workflows.

Not ideal if you only need basic text manipulation like simple string replacement or if you're working with very small, non-linguistic datasets.

natural-language-processing computational-linguistics text-preparation machine-translation data-cleaning
Maintenance 10 / 25
Adoption 15 / 25
Maturity 25 / 25
Community 18 / 25

How are scores calculated?

Stars

495

Forks

60

Language

Python

License

MIT

Last pushed

Feb 06, 2026

Commits (30d)

0

Dependencies

4

Reverse dependents

32

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/hplt-project/sacremoses"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.