himkt/konoha
🌿 An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.
When you need to break down Japanese text into individual words or sentences for analysis, this tool helps you do it consistently. You provide raw Japanese text, and it outputs the text segmented into meaningful units, like words or phrases. This is designed for data scientists, linguists, or anyone working with Japanese text who needs to prepare it for further computational processing.
261 stars.
Use this if you need to reliably split Japanese text into words or sentences and want the flexibility to easily switch between different text segmentation methods.
Not ideal if you are looking for advanced natural language understanding features beyond basic text segmentation, such as sentiment analysis or named entity recognition.
Stars
261
Forks
26
Language
Python
License
MIT
Category
Last pushed
Mar 01, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/himkt/konoha"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
EmilStenstrom/conllu
A CoNLL-U parser that takes a CoNLL-U formatted string and turns it into a nested python dictionary.
OpenPecha/Botok
🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python
zaemyung/sentsplit
A flexible sentence segmentation library using CRF model and regex rules
taishi-i/nagisa
A Japanese tokenizer based on recurrent neural networks
natasha/razdel
Rule-based token, sentence segmentation for Russian language