jiangnanboy/llm_corpus_quality
大模型预训练中文语料清洗及质量评估 Large model pre-training corpus cleaning
This project helps AI researchers and data scientists prepare high-quality Chinese text data for training large language models. It takes raw, potentially messy Chinese text corpora as input and outputs a cleaned, de-duplicated, and quality-assessed dataset, free from sensitive or advertising content. This ensures the foundational data used for powerful AI models is reliable and safe.
No commits in the last 6 months.
Use this if you are an AI researcher or data scientist working with large Chinese text datasets and need to rigorously clean and assess their quality before using them for training large language models.
Not ideal if your primary need is for non-Chinese text data cleaning, or if you require advanced, domain-specific content moderation beyond general sensitive and advertising text.
Stars
76
Forks
7
Language
Java
License
—
Category
Last pushed
Jul 25, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/jiangnanboy/llm_corpus_quality"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
mikahama/uralicNLP
An NLP library for Uralic languages such as Finnish, Skolt Sami, Moksha and so on. Also...
SkyworkAI/Skywork
Skywork series models are pre-trained on 3.2TB of high-quality multilingual (mainly Chinese and...
gia-uh/lingo
A Python library for context engineering.
shamspias/lexsublm-lite
A laptop‑friendly toolkit for context‑aware single‑word paraphrasing and lexical‑substitution...
AragonerUA/SampoNLP
A corpus-free toolkit for morphological lexicon creation and tokenizer evaluation using...