GermanT5/wikipedia2corpus

Wikipedia text corpus for self-supervised NLP model training

/ 100

Emerging

This project helps machine learning engineers and NLP researchers by providing cleaned, sentence-segmented text from Wikipedia. It takes raw Wikipedia database dumps and transforms them into a ready-to-use text corpus where each line is a single sentence. This output is ideal for training custom natural language processing models.

No commits in the last 6 months.

Use this if you need a large, pre-processed text dataset from Wikipedia in English or German for training or fine-tuning NLP models.

Not ideal if you require text from specific, niche domains outside of general encyclopedic knowledge.

natural-language-processing machine-learning-engineering text-corpus-creation data-preparation model-training-data

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 8 / 25

Maturity 16 / 25

Community 14 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Higher-rated alternatives

DerwenAI/pytextrank

Python implementation of TextRank algorithms ("textgraphs") for phrase extraction

Tiiiger/bert_score

BERT score for text generation

BrikerMan/Kashgari

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for...

asyml/texar

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. ...

yohasebe/wp2txt

A command-line tool to extract plain text from Wikipedia dumps with category and section filtering

Explore NLP Tools

All categories Trending NLP directory Insights