OpenPecha/Botok

🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python

/ 100

Established

This tool helps researchers, linguists, and anyone working with Tibetan texts to automatically break down raw Tibetan language into individual words. You provide a block of Tibetan text or a document, and it outputs the text with words clearly separated, optionally providing grammatical information like part-of-speech and the root form of each word. It's designed for anyone needing to analyze, process, or prepare Tibetan text for further study.

Used by 1 other package. Available on PyPI.

Use this if you need to precisely segment Tibetan text into words, understand their grammatical roles, or process large volumes of text for linguistic analysis or digital archiving.

Not ideal if you're only looking for simple space-based separation or don't need any detailed linguistic analysis of Tibetan text.

Tibetan language processing linguistic analysis text digitization cultural heritage NLP for Tibetan

Maintenance 13 / 25

Adoption 10 / 25

Maturity 25 / 25

Community 18 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

Apache-2.0

Related tools

EmilStenstrom/conllu

A CoNLL-U parser that takes a CoNLL-U formatted string and turns it into a nested python dictionary.

taishi-i/nagisa

A Japanese tokenizer based on recurrent neural networks

zaemyung/sentsplit

A flexible sentence segmentation library using CRF model and regex rules

natasha/razdel

Rule-based token, sentence segmentation for Russian language

polm/cutlet

Japanese to romaji converter in Python

Explore NLP Tools

All categories Trending NLP directory Insights