oscar-project/ungoliant

:spider: The pipeline for the OSCAR corpus

48
/ 100
Emerging

This tool helps researchers and natural language processing (NLP) practitioners create large, high-quality text datasets from CommonCrawl. It takes raw web data and processes it into a clean, language-identified corpus suitable for training language models or conducting linguistic analysis. The primary users are researchers or engineers who need to build custom, large-scale text corpora.

176 stars and 6 monthly downloads.

Use this if you need to build your own massive, cleaned text corpus from CommonCrawl web archives for NLP research or application development.

Not ideal if you're looking for an already-prepared, ready-to-use dataset and don't want to manage the data generation pipeline yourself.

Natural Language Processing Corpus Linguistics Text Data Collection Language Model Training Web Data Processing
No Package No Dependents
Maintenance 6 / 25
Adoption 12 / 25
Maturity 16 / 25
Community 14 / 25

How are scores calculated?

Stars

176

Forks

17

Language

Rust

License

Apache-2.0

Last pushed

Nov 09, 2025

Monthly downloads

6

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/oscar-project/ungoliant"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.