ikegami-yukino/neologdn
Japanese text normalizer for mecab-neologd
When preparing Japanese text for analysis, this tool cleans up messy characters, inconsistent symbols, and repeated phrases. It takes raw Japanese text, like user comments or articles, and standardizes it into a consistent format. Anyone working with Japanese text data for tasks like sentiment analysis, search, or information extraction would find this useful.
287 stars.
Use this if you need to standardize Japanese text to improve the accuracy of natural language processing tasks like keyword extraction or topic modeling.
Not ideal if you primarily work with languages other than Japanese, as its normalization rules are specific to Japanese text.
Stars
287
Forks
20
Language
Cython
License
Apache-2.0
Category
Last pushed
Dec 02, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/ikegami-yukino/neologdn"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
EmilStenstrom/conllu
A CoNLL-U parser that takes a CoNLL-U formatted string and turns it into a nested python dictionary.
OpenPecha/Botok
🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python
taishi-i/nagisa
A Japanese tokenizer based on recurrent neural networks
zaemyung/sentsplit
A flexible sentence segmentation library using CRF model and regex rules
natasha/razdel
Rule-based token, sentence segmentation for Russian language