CLUECorpus2020 and CLUEPretrainedModels
About CLUECorpus2020
CLUEbenchmark/CLUECorpus2020
Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料
This project offers a massive, cleaned collection of Chinese text data, perfect for training language models or generating Chinese text. It takes raw Chinese web content and refines it into a high-quality corpus, ready for use in various natural language processing applications. Data scientists, AI researchers, or developers working on Chinese language technologies would find this valuable.
About CLUEPretrainedModels
CLUEbenchmark/CLUEPretrainedModels
高质量中文预训练模型集合:最先进大模型、最快小模型、相似度专门模型
This project provides pre-trained models specifically designed for understanding Chinese text. It takes raw Chinese text as input and helps classify content, determine sentence relationships, or find semantic similarities. The outputs are high-quality text analysis results for various tasks. This is ideal for developers and data scientists building applications that need to process and understand Chinese language data.
Scores updated daily from GitHub, PyPI, and npm data. How scores work