oscar-project/ungoliant
:spider: The pipeline for the OSCAR corpus
This tool helps researchers and natural language processing (NLP) practitioners create large, high-quality text datasets from CommonCrawl. It takes raw web data and processes it into a clean, language-identified corpus suitable for training language models or conducting linguistic analysis. The primary users are researchers or engineers who need to build custom, large-scale text corpora.
176 stars and 6 monthly downloads.
Use this if you need to build your own massive, cleaned text corpus from CommonCrawl web archives for NLP research or application development.
Not ideal if you're looking for an already-prepared, ready-to-use dataset and don't want to manage the data generation pipeline yourself.
Stars
176
Forks
17
Language
Rust
License
Apache-2.0
Category
Last pushed
Nov 09, 2025
Monthly downloads
6
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/oscar-project/ungoliant"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.