OpenPecha/Botok
🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python
This tool helps researchers, linguists, and anyone working with Tibetan texts to automatically break down raw Tibetan language into individual words. You provide a block of Tibetan text or a document, and it outputs the text with words clearly separated, optionally providing grammatical information like part-of-speech and the root form of each word. It's designed for anyone needing to analyze, process, or prepare Tibetan text for further study.
Used by 1 other package. Available on PyPI.
Use this if you need to precisely segment Tibetan text into words, understand their grammatical roles, or process large volumes of text for linguistic analysis or digital archiving.
Not ideal if you're only looking for simple space-based separation or don't need any detailed linguistic analysis of Tibetan text.
Stars
78
Forks
16
Language
Python
License
Apache-2.0
Category
Last pushed
Mar 16, 2026
Commits (30d)
0
Dependencies
2
Reverse dependents
1
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/OpenPecha/Botok"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
EmilStenstrom/conllu
A CoNLL-U parser that takes a CoNLL-U formatted string and turns it into a nested python dictionary.
taishi-i/nagisa
A Japanese tokenizer based on recurrent neural networks
zaemyung/sentsplit
A flexible sentence segmentation library using CRF model and regex rules
natasha/razdel
Rule-based token, sentence segmentation for Russian language
polm/cutlet
Japanese to romaji converter in Python