deanmalmgren/textract

extract text from any document. no muss. no fuss.

68
/ 100
Established

This tool helps you quickly get the raw text content out of various digital documents, no matter their original format. You provide a document file (like a PDF, Word document, or image) and it gives you back just the text, ready for use. This is perfect for data analysts, researchers, or anyone needing to extract text for analysis or archiving.

4,482 stars. Used by 1 other package. Available on PyPI.

Use this if you need to reliably pull text out of many different types of documents without manual copying and pasting.

Not ideal if you need to extract specific data fields or maintain the original document's formatting and layout.

document-processing data-extraction text-mining content-analysis digital-archiving
Maintenance 10 / 25
Adoption 11 / 25
Maturity 25 / 25
Community 22 / 25

How are scores calculated?

Stars

4,482

Forks

665

Language

HTML

License

MIT

Last pushed

Feb 04, 2026

Commits (30d)

0

Dependencies

10

Reverse dependents

1

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/deanmalmgren/textract"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.