NanoNets/docext
An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)
This toolkit helps businesses convert various documents like invoices, passports, PDFs, and images into structured markdown or extract specific information. It takes your unstructured documents and produces organized data, ready for analysis or integration, all without needing an internet connection. This is ideal for operations managers, compliance officers, and data entry teams dealing with large volumes of documents.
1,871 stars. No commits in the last 6 months. Available on PyPI.
Use this if you need to extract specific details from documents, convert them into a structured markdown format, or benchmark the performance of document processing AI models, all while keeping your data on your own servers.
Not ideal if you need a cloud-based solution or if your primary need is simple text recognition without complex semantic understanding or structured data extraction.
Stars
1,871
Forks
135
Language
Python
License
Apache-2.0
Category
Last pushed
Aug 25, 2025
Commits (30d)
0
Dependencies
20
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/NanoNets/docext"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
kreuzberg-dev/kreuzberg
A polyglot document intelligence framework with a Rust core. Extract text, metadata, and...
PaddlePaddle/PaddleOCR
Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR...
yfedoseev/pdf_oxide
The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown...
opendataloader-project/opendataloader-pdf
PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.
AKSarav/pdfstract
PDFStract - The Extraction and Chunking Layer in Your RAG Pipeline - Available as CLI - WEBUI - API