bigscience-workshop/data-preparation
Code used for sourcing and cleaning the BigScience ROOTS corpus
This project helps create very large, clean text datasets for training advanced AI language models. It takes raw, unstructured text from various web sources, cleans it of noise, filters for quality, and removes duplicate content. The output is a massive, high-quality text corpus ready for use in machine learning. This is for researchers and engineers building large-scale language models.
319 stars. No commits in the last 6 months.
Use this if you need to systematically source, clean, and prepare multi-terabyte scale text data for training language models.
Not ideal if you're working with small-scale datasets or need a general-purpose text cleaning tool for non-AI tasks.
Stars
319
Forks
42
Language
Jupyter Notebook
License
Apache-2.0
Category
Last pushed
Mar 20, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/bigscience-workshop/data-preparation"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
PaddlePaddle/ERNIE
The official repository for ERNIE 4.5 and ERNIEKit – its industrial-grade development toolkit...
eyurtsev/kor
LLM(😽)
NiuTrans/NLPBook
A comprehensive book on neural networks and large language models in NLP
allenai/TOPICAL
:magic_wand::page_facing_up: TOPICAL: TOPIC pages AutomagicaLly
ditto-assistant/nlp_server
NLP server housing intent and NER models as well as an LLM agent with long term memory vector...