ucto and python-ucto

One is a Python binding to the other, making them ecosystem siblings where the Python tool serves as a client library for the core tokenizer.

ucto

Established

python-ucto

Emerging

Maintenance 10/25

Adoption 9/25

Maturity 16/25

Community 18/25

Maintenance 10/25

Adoption 7/25

Maturity 17/25

Community 13/25

Stars: 70

Forks: 14

Downloads: —

Commits (30d): 0

Language: C++

License: GPL-3.0

Stars: 31

Forks: 5

Downloads: —

Commits (30d): 0

Language: Cython

License: —

No Package No Dependents

No License

About ucto

LanguageMachines/ucto

Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules for several languages and can be easily extended to suit other languages. It has been incorporated for tokenizing Dutch text in Frog, our Dutch morpho-syntactic processor. http://ilk.uvt.nl/ucto --

This tool helps researchers and developers in natural language processing (NLP) to prepare raw text for analysis. It takes text files as input and precisely separates words from punctuation, splits sentences, and can even change text case. The output is clean, pre-processed text suitable for tasks like building search indexes, part-of-speech tagging, or machine translation.

Natural Language Processing Text Preprocessing Linguistic Analysis Computational Linguistics Information Extraction

About python-ucto

proycon/python-ucto

This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is regular-expression based, extensible, and advanced tokeniser written in C++ (http://ilk.uvt.nl/ucto).

This project helps Natural Language Processing (NLP) practitioners break down raw text into individual words or punctuation marks. It takes plain text documents or FoLiA XML as input and produces a sequence of tokens (words, numbers, punctuation) as output, which can then be used for further linguistic analysis. Anyone working on text analysis, linguistic research, or building text-based applications would find this useful.

natural-language-processing text-analysis computational-linguistics linguistic-research data-preparation

Scores updated daily from GitHub, PyPI, and npm data. How scores work