Document OCR Extraction NLP Tools

Tools for extracting structured and unstructured text from documents (PDFs, scans, receipts, invoices, IDs) using OCR and computer vision. Does NOT include general document analysis, summarization, or retrieval systems without extraction focus.

There are 63 document ocr extraction tools tracked. 1 score above 70 (verified tier). The highest-rated is deepdoctection/deepdoctection at 76/100 with 3,147 stars. 2 of the top 10 are actively maintained.

Get all 63 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=document-ocr-extraction&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 deepdoctection/deepdoctection

A Repo For Document AI

76
Verified
2 deanmalmgren/textract

extract text from any document. no muss. no fuss.

68
Established
3 eikek/docspell

Assist in organizing your piles of documents, resulting from scanners,...

61
Established
4 zzzDavid/ICDAR-2019-SROIE

ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information...

51
Established
5 clovaai/donut

Official Implementation of OCR-free Document Understanding Transformer...

45
Emerging
6 axa-group/Parsr

Transforms PDF, Documents and Images into Enriched Structured Data

44
Emerging
7 Saransh-cpp/OCRed

Clever, simple, and intuitive wrapper functionalities for OCRing specific...

44
Emerging
8 gnana70/tamil_ocr

OCR Tamil is a powerful tool that can detect and recognize text in Tamil...

44
Emerging
9 JonnoB/reading_the_unreadable

A pipeline for performing OCR on historical newspapers

43
Emerging
10 rithulkamesh/docproc

Document Intelligence Platform — Extract, refine, and query documents with...

42
Emerging
11 NjoyimPeguy/ICDAR-2019-RRC-SROIE

ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information...

40
Emerging
12 s3nh/text-detector

Tool which allow you to detect and translate text.

39
Emerging
13 gani114433/OCR_workflow

N8N OCR workflow

36
Emerging
14 Rushi-Balapure/pdf_2_json_extractor

A high-performance Python library for extracting structured content from PDF...

36
Emerging
15 clovaai/webvicob

Official Implementation of Web-based Visual Corpus Builder (Webvicob), ICDAR 2023

36
Emerging
16 situx/CuneiPainter

An App to recognize cuneiform characters on your Android phone

35
Emerging
17 lukevanin/OCRAI

Optical Character Recognition Artificial Intelligence iOS app for Udacity nanodegree

35
Emerging
18 louisbrulenaudet/apple-ocr

Easy-to-Use Apple Vision wrapper for text extraction, scalar representation...

35
Emerging
19 trhgquan/OCR_chu_nom

Đồ án OCR chữ Nôm (CSC15006)

34
Emerging
20 Samuel310/Text-Recognition

Android application to extract text from an image using firebase MLkit.

33
Emerging
21 codebywiam/invoice-ocr

This project extracts key fields (like invoice number, date, total, and...

32
Emerging
22 Shulk97/daniel

This repository contain the implementation of DANIEL. (A fast Document...

32
Emerging
23 ierolsen/Business-Card-Reader-App

The main idea of this project is that extracting entities from the scanned...

31
Emerging
24 jweissenberger/auto-docs

A CLI tool that automatically generates documentation for python code using...

31
Emerging
25 macosnik/Recognize-text-from-image

Telegram-бот для распознавания текста на изображениях с использованием нейросетей

25
Experimental
26 isikmuhamm/unstructured-data-extraction-engine

Automated data ingestion pipeline for extracting plain text from proprietary...

24
Experimental
27 transybao1393/android-ocr

Android OCR using CameraX, support MLKit, support offline mode, support...

24
Experimental
28 meck93/ScanOrUploadMe

A React-Native mobile application that digitalizes physical event...

23
Experimental
29 erl-ang/interactive-ocr

Implementation of a couple of heuristics that estimate OCR quality without...

23
Experimental
30 itshivams/Persona-Driven-Document-Intelligence

Persona-Driven Document Intelligence – A lightweight, CPU-only system that...

23
Experimental
31 fmadore/iwac-ai-pipelines

AI pipelines for Omeka S digital collections - OCR correction, entity...

23
Experimental
32 xuan3986/Texthandle

Open source project provided to Baidu PaddlePaddle community. Apply...

22
Experimental
33 archity/doc-scanner

Computer Vision and NLP based document scanner, text extractor and summarizer.

21
Experimental
34 SivaPA08/text-capture

Captures screen regions, extracts text and copies it to the clipboard

21
Experimental
35 dev-sungman/recent-ocr-papers

this repo include paper review, code in text detection, text recognition,...

21
Experimental
36 esteininger/file-processor

A Python library that uses AI to convert unstructured files (like PDFs,...

21
Experimental
37 avirajsa/DocuMind

DocuMind - Python project for document analysis. Analyze, summarize, and...

20
Experimental
38 Keizouw8/OCR-Command-Line-Tool

A tool that can be used in the CLI or NodeJS environment to scan for text in...

20
Experimental
39 shubh11220/PDF-Text-Extraction

Create a data extraction platform for users to conveniently obtain data in a...

20
Experimental
40 DecisionNerd/docunderstand

A python system for Visually Rich Document Understanding

19
Experimental
41 mishaelaaa/OCR

This is a project in which I store all my attempts to create an application...

19
Experimental
42 Cool-fire/Snipps

📚 📝📜 A simple android app to convert information into digital snippets,...

19
Experimental
43 SundayOni/document-ocr-nlp-pipeline

End-to-end pipeline for extracting and structuring text from scanned, PDF...

19
Experimental
44 michael-borck/document-lens

Analyzes text documents for readability, academic integrity, and linguistic...

19
Experimental
45 husnutass/ml_kit_app

A Flutter mobile app to read data from business cards and save that data in...

18
Experimental
46 saloni-rangari/nlp-ocr-marathi

This mini-project implements Marathi handwritten text recognition using...

18
Experimental
47 emilyhasson/Text-Recognition

Scripts to convert low-quality scanned PDFs to text files using Google Cloud...

18
Experimental
48 HySonLab/TeBaAb

TeBaAb: Text-Based Antigen-Conditioned Antibody Redesign via Directed Evolution

18
Experimental
49 fdovila/PDF2TXT4NLP

an online Python web app that accepts academic articles in PDF format and...

17
Experimental
50 Prateek32177/TextlyAI

AI-powered tool to extract and classify text from images using OCR and...

17
Experimental
51 nicdriebe/ocr-ner-sharepic-evaluation

Bachelor's Thesis: Evaluation of open-source OCR and NER pipelines...

15
Experimental
52 iytedbb/OSPA-SuryaOCR

OSPA SuryaOCR – Advanced document processing framework for historical...

14
Experimental
53 Komorebirumu/awe-ms-20260315-2211-01

AI Historical Document Transcription & Analysis CLI Tool

14
Experimental
54 sushant1827/Mistral-OCR-PDF-Image

This workflow automates the extraction of structured information from PDFs...

14
Experimental
55 jacobmarks/pytesseract-ocr-plugin

Run optical character recognition with PyTesseract from the FiftyOne App!

13
Experimental
56 ITSAIDI/Textra

Welcome to the documentation repository for Textra tool. We combine OCR and...

12
Experimental
57 Duke-Chronicle-Project/awesome-historical-newspaper-analysis

Awesome historical newspaper analysis tools and literature

12
Experimental
58 Methila-Meem/AI_Invoice_Analyzer

Automatically extracts, validates, and structures invoice data from images...

12
Experimental
59 RamezCh/Arabic_Text_Extractor

This project is a fine tuned Tesseract OCR ara on Arabic handwriting with...

11
Experimental
60 jyothish-ram/invoice_ocr_api

Invoice OCR Extraction Flask API

11
Experimental
61 AmmarAhm3d/invoice-gemini-extracter

Invoice-Gemini-Extracter: Python tool to extract structured invoice data...

10
Experimental
62 keeganareeve/kr-ocr-project

Was developed in order to contribute to the COLRC online database.

10
Experimental
63 ohidurbappy/tabular-data-digitalizer

Project to convert handwritten tabular data to excel table

10
Experimental