jiangnanboy/llm_corpus_quality

大模型预训练中文语料清洗及质量评估 Large model pre-training corpus cleaning

28
/ 100
Experimental

This project helps AI researchers and data scientists prepare high-quality Chinese text data for training large language models. It takes raw, potentially messy Chinese text corpora as input and outputs a cleaned, de-duplicated, and quality-assessed dataset, free from sensitive or advertising content. This ensures the foundational data used for powerful AI models is reliable and safe.

No commits in the last 6 months.

Use this if you are an AI researcher or data scientist working with large Chinese text datasets and need to rigorously clean and assess their quality before using them for training large language models.

Not ideal if your primary need is for non-Chinese text data cleaning, or if you require advanced, domain-specific content moderation beyond general sensitive and advertising text.

Large Language Model Training Natural Language Processing Data Quality Assurance Chinese Text Data AI Data Preparation
No License Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 9 / 25
Maturity 8 / 25
Community 11 / 25

How are scores calculated?

Stars

76

Forks

7

Language

Java

License

Last pushed

Jul 25, 2024

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/jiangnanboy/llm_corpus_quality"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.