superdoc-dev/docx-corpus
The largest open corpus of classified docx documents
This project provides the largest open collection of classified Word documents, specifically .docx files, sourced from the public web. It takes raw .docx files and processes them to extract text, detect language, and classify them into 10 document types and 9 topics across over 46 languages. This resource is invaluable for researchers and developers working on document AI, natural language processing, or information retrieval projects that specifically deal with real-world Word documents.
Use this if you need a comprehensive, pre-classified dataset of actual .docx files to train and evaluate document AI models, perform text analysis on structured documents, or develop applications that understand Word document content.
Not ideal if your primary focus is on scanned images, PDFs, or other document formats, as this project is exclusively focused on native .docx files.
Stars
45
Forks
1
Language
TypeScript
License
MIT
Category
Last pushed
Mar 12, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/superdoc-dev/docx-corpus"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
DerwenAI/pytextrank
Python implementation of TextRank algorithms ("textgraphs") for phrase extraction
Tiiiger/bert_score
BERT score for text generation
BrikerMan/Kashgari
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for...
asyml/texar
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. ...
yohasebe/wp2txt
A command-line tool to extract plain text from Wikipedia dumps with category and section filtering