NanoNets/docext

An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)

/ 100

Established

This toolkit helps businesses convert various documents like invoices, passports, PDFs, and images into structured markdown or extract specific information. It takes your unstructured documents and produces organized data, ready for analysis or integration, all without needing an internet connection. This is ideal for operations managers, compliance officers, and data entry teams dealing with large volumes of documents.

1,871 stars. No commits in the last 6 months. Available on PyPI.

Use this if you need to extract specific details from documents, convert them into a structured markdown format, or benchmark the performance of document processing AI models, all while keeping your data on your own servers.

Not ideal if you need a cloud-based solution or if your primary need is simple text recognition without complex semantic understanding or structured data extraction.

document-processing data-extraction compliance operations-management information-management

Stale 6m

Maintenance 2 / 25

Adoption 10 / 25

Maturity 25 / 25

Community 19 / 25

How are scores calculated?

Stars

1,871

Forks

135

Language

Python

License

Apache-2.0

Related tools

kreuzberg-dev/kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text, metadata, and...

PaddlePaddle/PaddleOCR

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR...

yfedoseev/pdf_oxide

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown...

opendataloader-project/opendataloader-pdf

PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.

AKSarav/pdfstract

PDFStract - The Extraction and Chunking Layer in Your RAG Pipeline - Available as CLI - WEBUI - API

Explore RAG Tools

All categories Trending RAG directory Insights