GINK03/boosting-tree-tokenizer

Gradient Boosting Dicision Tree(LightGBM)を用い、教師ありで自然言語の分かちと形態素の推定を学習＆予想します。名称は珊瑚(sango)にしたい

/ 100

Experimental

This project helps Japanese language practitioners accurately break down sentences into individual words and identify their parts of speech, without needing extensive, predefined dictionaries. You input raw Japanese text, and it outputs the text segmented into words with their corresponding grammatical categories. This tool is for linguists, natural language processing researchers, or content analysts working with Japanese text who need flexible and robust morphological analysis.

No commits in the last 6 months.

Use this if you need to perform supervised Japanese morphological analysis and part-of-speech tagging, especially if you want to train a custom model without relying on fixed dictionary files.

Not ideal if you require a pre-built, off-the-shelf solution for general Japanese text analysis that doesn't involve custom training or model generation.

Japanese-language-analysis morphological-analysis part-of-speech-tagging text-segmentation linguistics

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 5 / 25

Maturity 16 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

C++

License

MIT

Higher-rated alternatives

EmilStenstrom/conllu

A CoNLL-U parser that takes a CoNLL-U formatted string and turns it into a nested python dictionary.

OpenPecha/Botok

🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python

zaemyung/sentsplit

A flexible sentence segmentation library using CRF model and regex rules

taishi-i/nagisa

A Japanese tokenizer based on recurrent neural networks

natasha/razdel

Rule-based token, sentence segmentation for Russian language

Explore NLP Tools

All categories Trending NLP directory Insights