thunlp/THUOCL
THUOCL(THU Open Chinese Lexicon)中文词库
This project provides a collection of high-quality Chinese vocabulary lists covering various domains like IT, finance, medicine, and law. Each list contains common words and their document frequency (DF) values, which indicate how often a word appears in a large collection of texts. These curated lists are designed to improve the accuracy of Chinese text segmentation for natural language processing practitioners.
1,034 stars. No commits in the last 6 months.
Use this if you are working with Chinese text and need specialized vocabulary lists to achieve more precise word segmentation in your applications or research.
Not ideal if you need a dictionary for general lookup or translation, as these lists are specifically formatted and curated for computational linguistics tasks.
Stars
1,034
Forks
206
Language
—
License
MIT
Category
Last pushed
Apr 03, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/thunlp/THUOCL"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
NateScarlet/holiday-cn
📅🇨🇳中国法定节假日数据 自动每日抓取国务院公告
sagorbrur/bnlp
BNLP is a natural language processing toolkit for Bengali Language.
brightmart/nlp_chinese_corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
houbb/sensitive-word
👮♂️The sensitive word tool for java.(敏感词/违禁词/违法词/脏词。基于 DFA 算法实现的高性能 java...
esbatmop/MNBVC
MNBVC(Massive Never-ending BT Vast Chinese...