tetutaro/mecab_dictionaries
create various dictionaries for MeCab and MeCab CLI using fugashi
When performing Japanese natural language processing, you need specialized dictionaries to accurately split sentences into individual words. This project provides scripts to create ready-to-use Python packages of various MeCab dictionaries, including UniDic, IPA, and JUMAN dictionaries, optionally enhanced with NEologd. It's for data scientists or researchers who need precise Japanese text analysis.
No commits in the last 6 months.
Use this if you are a developer working on Japanese text analysis and need to quickly set up MeCab dictionaries within your Python environment, especially when using 'fugashi' or 'mecab-python3'.
Not ideal if you're looking for a pre-packaged application or a non-technical solution to analyze Japanese text without needing to build or manage dictionary resources yourself.
Stars
8
Forks
—
Language
Python
License
MIT
Category
Last pushed
Feb 19, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/tetutaro/mecab_dictionaries"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
EmilStenstrom/conllu
A CoNLL-U parser that takes a CoNLL-U formatted string and turns it into a nested python dictionary.
OpenPecha/Botok
🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python
taishi-i/nagisa
A Japanese tokenizer based on recurrent neural networks
zaemyung/sentsplit
A flexible sentence segmentation library using CRF model and regex rules
natasha/razdel
Rule-based token, sentence segmentation for Russian language