wittawatj/jtcc
Java library to tokenize Thai text into a list of TCCs
This tool helps prepare Thai text for natural language processing by breaking it down into 'Thai Character Clusters' (TCCs). You input raw Thai text, either through the command line or from a file, and it outputs a sequence of TCCs, which are inseparable groups of Thai characters. This is mainly for developers building larger Thai NLP systems.
No commits in the last 6 months.
Use this if you are developing a Thai natural language processing application and need a foundational step to segment Thai text into character clusters.
Not ideal if you need a full word segmenter, syllable tokenizer, or a tool that considers grammatical context for text analysis.
Stars
19
Forks
5
Language
Java
License
GPL-3.0
Category
Last pushed
May 30, 2017
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/wittawatj/jtcc"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
PyThaiNLP/pythainlp
Thai natural language processing in Python
hankcs/HanLP
Natural Language Processing for the next decade. Tokenization, Part-of-Speech Tagging, Named...
jacksonllee/pycantonese
Cantonese Linguistics and NLP
dongrixinyu/JioNLP
中文 NLP 预处理、解析工具包,准确、高效、易用 A Chinese NLP Preprocessing & Parsing Package www.jionlp.com
hankcs/pyhanlp
中文分词