clipperhouse/uax29

A tokenizer based on Unicode text segmentation (UAX #29), for Go. Split graphemes, words, sentences.

/ 100

Emerging

This tool helps developers accurately break down text into its fundamental units like graphemes, words, and sentences, following the Unicode standard. It takes in raw text and outputs a stream of these text segments, which are crucial for natural language processing tasks. Developers building multilingual search engines, text analysis tools, or language understanding models would find this valuable.

101 stars.

Use this if you need a reliable, multilingual way to segment text into words, sentences, or individual characters for tasks like building an inverted index or performing text analysis.

Not ideal if your application doesn't require precise, Unicode-conformant text segmentation and a simple split by spaces is sufficient.

text-segmentation natural-language-processing full-text-search text-analysis multilingual-text

No Package No Dependents

Maintenance 10 / 25

Adoption 9 / 25

Maturity 16 / 25

Community 9 / 25

How are scores calculated?

Stars

101

Forks

Language

License

MIT

Higher-rated alternatives

ikawaha/kagome-dict

Dictionary Library for Kagome v2

aaaton/golem

A lemmatizer implemented in Go

habeanf/yap

Yet Another (natural language) Parser

abadojack/whatlanggo

Natural language detection library for Go

jdkato/prose

:book: A Golang library for text processing, including tokenization, part-of-speech tagging, and...

Explore NLP Tools

All categories Trending NLP directory Insights