bigscience-workshop/data-preparation

Code used for sourcing and cleaning the BigScience ROOTS corpus

44
/ 100
Emerging

This project helps create very large, clean text datasets for training advanced AI language models. It takes raw, unstructured text from various web sources, cleans it of noise, filters for quality, and removes duplicate content. The output is a massive, high-quality text corpus ready for use in machine learning. This is for researchers and engineers building large-scale language models.

319 stars. No commits in the last 6 months.

Use this if you need to systematically source, clean, and prepare multi-terabyte scale text data for training language models.

Not ideal if you're working with small-scale datasets or need a general-purpose text cleaning tool for non-AI tasks.

AI-data-preparation large-language-models text-corpus-creation machine-learning-engineering natural-language-processing-research
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 18 / 25

How are scores calculated?

Stars

319

Forks

42

Language

Jupyter Notebook

License

Apache-2.0

Last pushed

Mar 20, 2023

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/bigscience-workshop/data-preparation"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.