liuzl/tokenizer

Natural Language Tokenizer

/ 100

Experimental

This is a fundamental tool for anyone working with text data who needs to break down sentences into individual words or meaningful units. It takes raw text in various languages and outputs a clean list of its constituent words, correctly handling special cases like contractions and possessives. It's designed for developers building applications that process or analyze human language.

No commits in the last 6 months.

Use this if you are a developer building a search engine, text analyzer, or any application that needs to accurately segment multilingual text into individual words.

Not ideal if you need advanced natural language processing features beyond basic word segmentation, such as sentiment analysis or part-of-speech tagging.

natural-language-processing text-analysis search-engine-development multilingual-text information-retrieval

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 5 / 25

Maturity 16 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

License

Apache-2.0

Higher-rated alternatives

ikawaha/kagome-dict

Dictionary Library for Kagome v2

aaaton/golem

A lemmatizer implemented in Go

habeanf/yap

Yet Another (natural language) Parser

clipperhouse/uax29

A tokenizer based on Unicode text segmentation (UAX #29), for Go. Split graphemes, words, sentences.

abadojack/whatlanggo

Natural language detection library for Go

Explore NLP Tools

All categories Trending NLP directory Insights