JackHCC/Chinese-Tokenization
利用传统方法(N-gram,HMM等)、神经网络方法(CNN,LSTM等)和预训练方法(Bert等)的中文分词任务实现【The word segmentation task is realized by using traditional methods (n-gram, HMM, etc.), neural network methods (CNN, LSTM, etc.) and pre training methods (Bert, etc.)】
This project helps natural language processing (NLP) practitioners accurately break down continuous Chinese text into individual words, a critical first step for many text analysis tasks. It takes raw Chinese sentences or documents as input and outputs segmented text, ready for further processing like sentiment analysis or information extraction. NLP developers and researchers working with Chinese text data would find this useful for building and evaluating segmentation models.
No commits in the last 6 months.
Use this if you need to implement or compare various Chinese word segmentation algorithms, from traditional to advanced deep learning methods, for your NLP applications or research.
Not ideal if you're looking for a simple, pre-built API or tool for immediate Chinese text segmentation without needing to understand or experiment with the underlying models.
Stars
38
Forks
4
Language
Python
License
—
Category
Last pushed
Jun 15, 2022
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/JackHCC/Chinese-Tokenization"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
PyThaiNLP/pythainlp
Thai natural language processing in Python
hankcs/HanLP
Natural Language Processing for the next decade. Tokenization, Part-of-Speech Tagging, Named...
jacksonllee/pycantonese
Cantonese Linguistics and NLP
dongrixinyu/JioNLP
中文 NLP 预处理、解析工具包,准确、高效、易用 A Chinese NLP Preprocessing & Parsing Package www.jionlp.com
hankcs/pyhanlp
中文分词