superdoc-dev/docx-corpus

The largest open corpus of classified docx documents

/ 100

Emerging

This project provides the largest open collection of classified Word documents, specifically .docx files, sourced from the public web. It takes raw .docx files and processes them to extract text, detect language, and classify them into 10 document types and 9 topics across over 46 languages. This resource is invaluable for researchers and developers working on document AI, natural language processing, or information retrieval projects that specifically deal with real-world Word documents.

Use this if you need a comprehensive, pre-classified dataset of actual .docx files to train and evaluate document AI models, perform text analysis on structured documents, or develop applications that understand Word document content.

Not ideal if your primary focus is on scanned images, PDFs, or other document formats, as this project is exclusively focused on native .docx files.

document-intelligence natural-language-processing information-retrieval data-science text-analytics

No Package No Dependents

Maintenance 10 / 25

Adoption 8 / 25

Maturity 13 / 25

Community 3 / 25

How are scores calculated?

Stars

Forks

Language

TypeScript

License

MIT

Higher-rated alternatives

DerwenAI/pytextrank

Python implementation of TextRank algorithms ("textgraphs") for phrase extraction

Tiiiger/bert_score

BERT score for text generation

BrikerMan/Kashgari

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for...

asyml/texar

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. ...

yohasebe/wp2txt

A command-line tool to extract plain text from Wikipedia dumps with category and section filtering

Explore NLP Tools

All categories Trending NLP directory Insights