GermanT5/wikipedia2corpus
Wikipedia text corpus for self-supervised NLP model training
This project helps machine learning engineers and NLP researchers by providing cleaned, sentence-segmented text from Wikipedia. It takes raw Wikipedia database dumps and transforms them into a ready-to-use text corpus where each line is a single sentence. This output is ideal for training custom natural language processing models.
No commits in the last 6 months.
Use this if you need a large, pre-processed text dataset from Wikipedia in English or German for training or fine-tuning NLP models.
Not ideal if you require text from specific, niche domains outside of general encyclopedic knowledge.
Stars
46
Forks
7
Language
Python
License
MIT
Category
Last pushed
Jul 17, 2022
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/GermanT5/wikipedia2corpus"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
DerwenAI/pytextrank
Python implementation of TextRank algorithms ("textgraphs") for phrase extraction
Tiiiger/bert_score
BERT score for text generation
BrikerMan/Kashgari
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for...
asyml/texar
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. ...
yohasebe/wp2txt
A command-line tool to extract plain text from Wikipedia dumps with category and section filtering