deanmalmgren/textract
extract text from any document. no muss. no fuss.
This tool helps you quickly get the raw text content out of various digital documents, no matter their original format. You provide a document file (like a PDF, Word document, or image) and it gives you back just the text, ready for use. This is perfect for data analysts, researchers, or anyone needing to extract text for analysis or archiving.
4,482 stars. Used by 1 other package. Available on PyPI.
Use this if you need to reliably pull text out of many different types of documents without manual copying and pasting.
Not ideal if you need to extract specific data fields or maintain the original document's formatting and layout.
Stars
4,482
Forks
665
Language
HTML
License
MIT
Category
Last pushed
Feb 04, 2026
Commits (30d)
0
Dependencies
10
Reverse dependents
1
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/deanmalmgren/textract"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
deepdoctection/deepdoctection
A Repo For Document AI
eikek/docspell
Assist in organizing your piles of documents, resulting from scanners, e-mails and other sources...
zzzDavid/ICDAR-2019-SROIE
ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction
clovaai/donut
Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic...
axa-group/Parsr
Transforms PDF, Documents and Images into Enriched Structured Data