jiangnanboy/llm_corpus_quality

大模型预训练中文语料清洗及质量评估 Large model pre-training corpus cleaning

/ 100

Experimental

This project helps AI researchers and data scientists prepare high-quality Chinese text data for training large language models. It takes raw, potentially messy Chinese text corpora as input and outputs a cleaned, de-duplicated, and quality-assessed dataset, free from sensitive or advertising content. This ensures the foundational data used for powerful AI models is reliable and safe.

No commits in the last 6 months.

Use this if you are an AI researcher or data scientist working with large Chinese text datasets and need to rigorously clean and assess their quality before using them for training large language models.

Not ideal if your primary need is for non-Chinese text data cleaning, or if you require advanced, domain-specific content moderation beyond general sensitive and advertising text.

Large Language Model Training Natural Language Processing Data Quality Assurance Chinese Text Data AI Data Preparation

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 9 / 25

Maturity 8 / 25

Community 11 / 25

How are scores calculated?

Stars

Forks

Language

Java

License

—

Higher-rated alternatives

mikahama/uralicNLP

An NLP library for Uralic languages such as Finnish, Skolt Sami, Moksha and so on. Also...

SkyworkAI/Skywork

Skywork series models are pre-trained on 3.2TB of high-quality multilingual (mainly Chinese and...

gia-uh/lingo

A Python library for context engineering.

shamspias/lexsublm-lite

A laptop‑friendly toolkit for context‑aware single‑word paraphrasing and lexical‑substitution...

AragonerUA/SampoNLP

A corpus-free toolkit for morphological lexicon creation and tokenizer evaluation using...

Explore NLP Tools

All categories Trending NLP directory Insights