CatchTheTornado/text-extract-api
Document (PDF, Word, PPTX ...) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown
This tool helps you convert scanned documents, images, PDFs, and Office files into editable text (Markdown) or structured data (JSON) with high accuracy. It's especially good at capturing tables, numbers, and even math formulas. You can also use it to automatically remove sensitive personal information from documents. This is ideal for professionals like data analysts, compliance officers, or researchers who need to extract and organize information from a variety of document types.
2,989 stars.
Use this if you need to reliably extract content from diverse document formats into a structured, editable form or automatically redact PII, while keeping all data processing local and private.
Not ideal if you're looking for a simple, cloud-based document conversion service without needing advanced features like PII removal or local AI model integration.
Stars
2,989
Forks
252
Language
Python
License
MIT
Category
Last pushed
Dec 08, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/CatchTheTornado/text-extract-api"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
NanoNets/docstrange
Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple...
th1nhhdk/local_ai_ocr
An local, offline (after initial setup), portable OCR software that can process images and PDF...
Dicklesworthstone/llm_aided_ocr
Enhances Tesseract OCR output using LLMs (local or API) for error correction, smart chunking,...
emcf/thepipe
Get clean data from tricky documents, powered by vision-language models ⚡
langstruct-ai/langstruct
Extract structured data from any content using LLMs.