oscar-project/ungoliant

:spider: The pipeline for the OSCAR corpus

/ 100

Emerging

This tool helps researchers and natural language processing (NLP) practitioners create large, high-quality text datasets from CommonCrawl. It takes raw web data and processes it into a clean, language-identified corpus suitable for training language models or conducting linguistic analysis. The primary users are researchers or engineers who need to build custom, large-scale text corpora.

176 stars and 6 monthly downloads.

Use this if you need to build your own massive, cleaned text corpus from CommonCrawl web archives for NLP research or application development.

Not ideal if you're looking for an already-prepared, ready-to-use dataset and don't want to manage the data generation pipeline yourself.

Natural Language Processing Corpus Linguistics Text Data Collection Language Model Training Web Data Processing

No Package No Dependents

Maintenance 6 / 25

Adoption 12 / 25

Maturity 16 / 25

Community 14 / 25

How are scores calculated?

Stars

176

Forks

Language

Rust

License

Apache-2.0

Higher-rated alternatives

ChenghaoMou/text-dedup

All-in-one text de-duplication

loretoparisi/fasttext.js

FastText for Node.js

messense/fasttext-serving

fastText model serving service

gagan3012/PolyDeDupe

PolyDeDupe: Multi-Lingual Data Deduplication

vrasneur/pyfasttext

Yet another Python binding for fastText

Explore NLP Tools

All categories Trending NLP directory Insights