CanCLID/canto-filter
粵文語料篩選器 Cantonese text filter
This tool helps researchers, linguists, or educators quickly sort large volumes of Chinese text. You provide a list of sentences, and it classifies each as pure Cantonese, pure Mandarin, mixed, or neutral, helping you identify and extract high-quality Cantonese content. It's designed for anyone needing to build or analyze Cantonese text collections efficiently.
Use this if you need to rapidly filter large datasets to precisely extract high-quality, unambiguous Cantonese text, prioritizing accuracy of Cantonese identification over classifying every single sentence.
Not ideal if you need a highly granular or exhaustive classification of all input sentences, including ambiguous or less clear-cut cases, and can tolerate slower processing speeds.
Stars
41
Forks
4
Language
Python
License
MIT
Category
Last pushed
Feb 04, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/CanCLID/canto-filter"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
NateScarlet/holiday-cn
📅🇨🇳中国法定节假日数据 自动每日抓取国务院公告
sagorbrur/bnlp
BNLP is a natural language processing toolkit for Bengali Language.
brightmart/nlp_chinese_corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
houbb/sensitive-word
👮♂️The sensitive word tool for java.(敏感词/违禁词/违法词/脏词。基于 DFA 算法实现的高性能 java...
esbatmop/MNBVC
MNBVC(Massive Never-ending BT Vast Chinese...