clipperhouse/uax29
A tokenizer based on Unicode text segmentation (UAX #29), for Go. Split graphemes, words, sentences.
This tool helps developers accurately break down text into its fundamental units like graphemes, words, and sentences, following the Unicode standard. It takes in raw text and outputs a stream of these text segments, which are crucial for natural language processing tasks. Developers building multilingual search engines, text analysis tools, or language understanding models would find this valuable.
101 stars.
Use this if you need a reliable, multilingual way to segment text into words, sentences, or individual characters for tasks like building an inverted index or performing text analysis.
Not ideal if your application doesn't require precise, Unicode-conformant text segmentation and a simple split by spaces is sufficient.
Stars
101
Forks
6
Language
Go
License
MIT
Category
Last pushed
Feb 16, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/clipperhouse/uax29"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
ikawaha/kagome-dict
Dictionary Library for Kagome v2
aaaton/golem
A lemmatizer implemented in Go
habeanf/yap
Yet Another (natural language) Parser
abadojack/whatlanggo
Natural language detection library for Go
jdkato/prose
:book: A Golang library for text processing, including tokenization, part-of-speech tagging, and...