proycon/python-ucto

This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is regular-expression based, extensible, and advanced tokeniser written in C++ (http://ilk.uvt.nl/ucto).

/ 100

Emerging

This project helps Natural Language Processing (NLP) practitioners break down raw text into individual words or punctuation marks. It takes plain text documents or FoLiA XML as input and produces a sequence of tokens (words, numbers, punctuation) as output, which can then be used for further linguistic analysis. Anyone working on text analysis, linguistic research, or building text-based applications would find this useful.

Available on PyPI.

Use this if you need to precisely separate text into its constituent words and punctuation, especially for languages with complex tokenization rules, as a first step in NLP workflows.

Not ideal if you primarily work on Windows without WSL/Docker or require a simple, out-of-the-box solution for very basic English text splitting without advanced linguistic considerations.

natural-language-processing text-analysis computational-linguistics linguistic-research data-preparation

No License

Maintenance 10 / 25

Adoption 7 / 25

Maturity 17 / 25

Community 13 / 25

How are scores calculated?

Stars

Forks

Language

Cython

License

—

Compare

python-ucto and ucto

Higher-rated alternatives

google/sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

daac-tools/vibrato

🎤 vibrato: Viterbi-based accelerated tokenizer

OpenNMT/Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support

Systemcluster/kitoken

Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...

daac-tools/vaporetto

🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer

Explore NLP Tools

All categories Trending NLP directory Insights