GermanT5/wikipedia2corpus

Wikipedia text corpus for self-supervised NLP model training

38
/ 100
Emerging

This project helps machine learning engineers and NLP researchers by providing cleaned, sentence-segmented text from Wikipedia. It takes raw Wikipedia database dumps and transforms them into a ready-to-use text corpus where each line is a single sentence. This output is ideal for training custom natural language processing models.

No commits in the last 6 months.

Use this if you need a large, pre-processed text dataset from Wikipedia in English or German for training or fine-tuning NLP models.

Not ideal if you require text from specific, niche domains outside of general encyclopedic knowledge.

natural-language-processing machine-learning-engineering text-corpus-creation data-preparation model-training-data
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 8 / 25
Maturity 16 / 25
Community 14 / 25

How are scores calculated?

Stars

46

Forks

7

Language

Python

License

MIT

Last pushed

Jul 17, 2022

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/GermanT5/wikipedia2corpus"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.