proycon/python-ucto
This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is regular-expression based, extensible, and advanced tokeniser written in C++ (http://ilk.uvt.nl/ucto).
This project helps Natural Language Processing (NLP) practitioners break down raw text into individual words or punctuation marks. It takes plain text documents or FoLiA XML as input and produces a sequence of tokens (words, numbers, punctuation) as output, which can then be used for further linguistic analysis. Anyone working on text analysis, linguistic research, or building text-based applications would find this useful.
Available on PyPI.
Use this if you need to precisely separate text into its constituent words and punctuation, especially for languages with complex tokenization rules, as a first step in NLP workflows.
Not ideal if you primarily work on Windows without WSL/Docker or require a simple, out-of-the-box solution for very basic English text splitting without advanced linguistic considerations.
Stars
31
Forks
5
Language
Cython
License
—
Category
Last pushed
Feb 02, 2026
Commits (30d)
0
Dependencies
1
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/proycon/python-ucto"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Compare
Higher-rated alternatives
google/sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
daac-tools/vibrato
🎤 vibrato: Viterbi-based accelerated tokenizer
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Systemcluster/kitoken
Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...
daac-tools/vaporetto
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer