bigscience-workshop/data-preparation

Code used for sourcing and cleaning the BigScience ROOTS corpus

/ 100

Emerging

This project helps create very large, clean text datasets for training advanced AI language models. It takes raw, unstructured text from various web sources, cleans it of noise, filters for quality, and removes duplicate content. The output is a massive, high-quality text corpus ready for use in machine learning. This is for researchers and engineers building large-scale language models.

319 stars. No commits in the last 6 months.

Use this if you need to systematically source, clean, and prepare multi-terabyte scale text data for training language models.

Not ideal if you're working with small-scale datasets or need a general-purpose text cleaning tool for non-AI tasks.

AI-data-preparation large-language-models text-corpus-creation machine-learning-engineering natural-language-processing-research

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 18 / 25

How are scores calculated?

Stars

319

Forks

Language

Jupyter Notebook

License

Apache-2.0

Higher-rated alternatives

PaddlePaddle/ERNIE

The official repository for ERNIE 4.5 and ERNIEKit – its industrial-grade development toolkit...

eyurtsev/kor

LLM(😽)

NiuTrans/NLPBook

A comprehensive book on neural networks and large language models in NLP

allenai/TOPICAL

:magic_wand::page_facing_up: TOPICAL: TOPIC pages AutomagicaLly

ditto-assistant/nlp_server

NLP server housing intent and NER models as well as an LLM agent with long term memory vector...

Explore NLP Tools

All categories Trending NLP directory Insights