proycon/python-ucto

This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is regular-expression based, extensible, and advanced tokeniser written in C++ (http://ilk.uvt.nl/ucto).

47
/ 100
Emerging

This project helps Natural Language Processing (NLP) practitioners break down raw text into individual words or punctuation marks. It takes plain text documents or FoLiA XML as input and produces a sequence of tokens (words, numbers, punctuation) as output, which can then be used for further linguistic analysis. Anyone working on text analysis, linguistic research, or building text-based applications would find this useful.

Available on PyPI.

Use this if you need to precisely separate text into its constituent words and punctuation, especially for languages with complex tokenization rules, as a first step in NLP workflows.

Not ideal if you primarily work on Windows without WSL/Docker or require a simple, out-of-the-box solution for very basic English text splitting without advanced linguistic considerations.

natural-language-processing text-analysis computational-linguistics linguistic-research data-preparation
No License
Maintenance 10 / 25
Adoption 7 / 25
Maturity 17 / 25
Community 13 / 25

How are scores calculated?

Stars

31

Forks

5

Language

Cython

License

Last pushed

Feb 02, 2026

Commits (30d)

0

Dependencies

1

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/proycon/python-ucto"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.