Kensuke-Mitsuzawa/JapaneseTokenizers
aim to use JapaneseTokenizer as easy as possible
This project helps Japanese language specialists, like linguists, data analysts, or researchers, process raw Japanese text. It takes sentences or documents as input and outputs a list of individual words or tokens, along with their grammatical information (like part-of-speech). This allows for easier analysis of Japanese text by providing a standardized way to break down sentences.
137 stars. No commits in the last 6 months.
Use this if you need to reliably break down Japanese sentences into individual words for linguistic analysis, text mining, or other data processing tasks.
Not ideal if your work involves languages other than Japanese, or if you only need a basic word count without detailed grammatical information.
Stars
137
Forks
21
Language
Python
License
MIT
Category
Last pushed
Mar 25, 2019
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/Kensuke-Mitsuzawa/JapaneseTokenizers"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
EmilStenstrom/conllu
A CoNLL-U parser that takes a CoNLL-U formatted string and turns it into a nested python dictionary.
OpenPecha/Botok
🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python
zaemyung/sentsplit
A flexible sentence segmentation library using CRF model and regex rules
taishi-i/nagisa
A Japanese tokenizer based on recurrent neural networks
natasha/razdel
Rule-based token, sentence segmentation for Russian language