ikegami-yukino/neologdn

Japanese text normalizer for mecab-neologd

/ 100

Emerging

When preparing Japanese text for analysis, this tool cleans up messy characters, inconsistent symbols, and repeated phrases. It takes raw Japanese text, like user comments or articles, and standardizes it into a consistent format. Anyone working with Japanese text data for tasks like sentiment analysis, search, or information extraction would find this useful.

287 stars.

Use this if you need to standardize Japanese text to improve the accuracy of natural language processing tasks like keyword extraction or topic modeling.

Not ideal if you primarily work with languages other than Japanese, as its normalization rules are specific to Japanese text.

Japanese NLP text preprocessing data cleaning text analysis information extraction

No Package No Dependents

Maintenance 6 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 12 / 25

How are scores calculated?

Stars

287

Forks

Language

Cython

License

Apache-2.0

Higher-rated alternatives

EmilStenstrom/conllu

A CoNLL-U parser that takes a CoNLL-U formatted string and turns it into a nested python dictionary.

OpenPecha/Botok

🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python

taishi-i/nagisa

A Japanese tokenizer based on recurrent neural networks

zaemyung/sentsplit

A flexible sentence segmentation library using CRF model and regex rules

natasha/razdel

Rule-based token, sentence segmentation for Russian language

Explore NLP Tools

All categories Trending NLP directory Insights