esbatmop/MNBVC

MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化，也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。

/ 100

Established

This project offers a massive, continuously updated collection of diverse Chinese text data, covering everything from news and novels to niche online content and ancient poetry. It provides raw, unprocessed text files and supports for various formats. Researchers and developers working on large-scale language models or natural language processing applications for Chinese will find this data highly valuable.

4,144 stars. Actively maintained with 2 commits in the last 30 days.

Use this if you need an extremely large and varied dataset of Chinese text to train or evaluate AI language models, especially those designed for general understanding or creative text generation.

Not ideal if you require highly structured, domain-specific, or meticulously clean and indexed data for tasks where legal compliance or strict content categorization is paramount.

AI-training NLP-development Chinese-language-models data-science computational-linguistics

No Package No Dependents

Maintenance 13 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 18 / 25

How are scores calculated?

Stars

4,144

Forks

287

Language

—

License

MIT

Related tools

NateScarlet/holiday-cn

📅🇨🇳中国法定节假日数据自动每日抓取国务院公告

sagorbrur/bnlp

BNLP is a natural language processing toolkit for Bengali Language.

brightmart/nlp_chinese_corpus

大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP

houbb/sensitive-word

👮‍♂️The sensitive word tool for java.(敏感词/违禁词/违法词/脏词。基于 DFA 算法实现的高性能 java...

thunlp/THUOCL

THUOCL（THU Open Chinese Lexicon）中文词库

Explore NLP Tools

All categories Trending NLP directory Insights