superdoc-dev/docx-corpus

The largest open corpus of classified docx documents

34
/ 100
Emerging

This project provides the largest open collection of classified Word documents, specifically .docx files, sourced from the public web. It takes raw .docx files and processes them to extract text, detect language, and classify them into 10 document types and 9 topics across over 46 languages. This resource is invaluable for researchers and developers working on document AI, natural language processing, or information retrieval projects that specifically deal with real-world Word documents.

Use this if you need a comprehensive, pre-classified dataset of actual .docx files to train and evaluate document AI models, perform text analysis on structured documents, or develop applications that understand Word document content.

Not ideal if your primary focus is on scanned images, PDFs, or other document formats, as this project is exclusively focused on native .docx files.

document-intelligence natural-language-processing information-retrieval data-science text-analytics
No Package No Dependents
Maintenance 10 / 25
Adoption 8 / 25
Maturity 13 / 25
Community 3 / 25

How are scores calculated?

Stars

45

Forks

1

Language

TypeScript

License

MIT

Last pushed

Mar 12, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/superdoc-dev/docx-corpus"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.