Document OCR Extraction NLP Tools
Tools for extracting structured and unstructured text from documents (PDFs, scans, receipts, invoices, IDs) using OCR and computer vision. Does NOT include general document analysis, summarization, or retrieval systems without extraction focus.
There are 63 document ocr extraction tools tracked. 1 score above 70 (verified tier). The highest-rated is deepdoctection/deepdoctection at 76/100 with 3,147 stars. 2 of the top 10 are actively maintained.
Get all 63 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=document-ocr-extraction&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
deepdoctection/deepdoctection
A Repo For Document AI |
|
Verified |
| 2 |
deanmalmgren/textract
extract text from any document. no muss. no fuss. |
|
Established |
| 3 |
eikek/docspell
Assist in organizing your piles of documents, resulting from scanners,... |
|
Established |
| 4 |
zzzDavid/ICDAR-2019-SROIE
ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information... |
|
Established |
| 5 |
clovaai/donut
Official Implementation of OCR-free Document Understanding Transformer... |
|
Emerging |
| 6 |
axa-group/Parsr
Transforms PDF, Documents and Images into Enriched Structured Data |
|
Emerging |
| 7 |
Saransh-cpp/OCRed
Clever, simple, and intuitive wrapper functionalities for OCRing specific... |
|
Emerging |
| 8 |
gnana70/tamil_ocr
OCR Tamil is a powerful tool that can detect and recognize text in Tamil... |
|
Emerging |
| 9 |
JonnoB/reading_the_unreadable
A pipeline for performing OCR on historical newspapers |
|
Emerging |
| 10 |
rithulkamesh/docproc
Document Intelligence Platform — Extract, refine, and query documents with... |
|
Emerging |
| 11 |
NjoyimPeguy/ICDAR-2019-RRC-SROIE
ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information... |
|
Emerging |
| 12 |
s3nh/text-detector
Tool which allow you to detect and translate text. |
|
Emerging |
| 13 |
gani114433/OCR_workflow
N8N OCR workflow |
|
Emerging |
| 14 |
Rushi-Balapure/pdf_2_json_extractor
A high-performance Python library for extracting structured content from PDF... |
|
Emerging |
| 15 |
clovaai/webvicob
Official Implementation of Web-based Visual Corpus Builder (Webvicob), ICDAR 2023 |
|
Emerging |
| 16 |
situx/CuneiPainter
An App to recognize cuneiform characters on your Android phone |
|
Emerging |
| 17 |
lukevanin/OCRAI
Optical Character Recognition Artificial Intelligence iOS app for Udacity nanodegree |
|
Emerging |
| 18 |
louisbrulenaudet/apple-ocr
Easy-to-Use Apple Vision wrapper for text extraction, scalar representation... |
|
Emerging |
| 19 |
trhgquan/OCR_chu_nom
Đồ án OCR chữ Nôm (CSC15006) |
|
Emerging |
| 20 |
Samuel310/Text-Recognition
Android application to extract text from an image using firebase MLkit. |
|
Emerging |
| 21 |
codebywiam/invoice-ocr
This project extracts key fields (like invoice number, date, total, and... |
|
Emerging |
| 22 |
Shulk97/daniel
This repository contain the implementation of DANIEL. (A fast Document... |
|
Emerging |
| 23 |
ierolsen/Business-Card-Reader-App
The main idea of this project is that extracting entities from the scanned... |
|
Emerging |
| 24 |
jweissenberger/auto-docs
A CLI tool that automatically generates documentation for python code using... |
|
Emerging |
| 25 |
macosnik/Recognize-text-from-image
Telegram-бот для распознавания текста на изображениях с использованием нейросетей |
|
Experimental |
| 26 |
isikmuhamm/unstructured-data-extraction-engine
Automated data ingestion pipeline for extracting plain text from proprietary... |
|
Experimental |
| 27 |
transybao1393/android-ocr
Android OCR using CameraX, support MLKit, support offline mode, support... |
|
Experimental |
| 28 |
meck93/ScanOrUploadMe
A React-Native mobile application that digitalizes physical event... |
|
Experimental |
| 29 |
erl-ang/interactive-ocr
Implementation of a couple of heuristics that estimate OCR quality without... |
|
Experimental |
| 30 |
itshivams/Persona-Driven-Document-Intelligence
Persona-Driven Document Intelligence – A lightweight, CPU-only system that... |
|
Experimental |
| 31 |
fmadore/iwac-ai-pipelines
AI pipelines for Omeka S digital collections - OCR correction, entity... |
|
Experimental |
| 32 |
xuan3986/Texthandle
Open source project provided to Baidu PaddlePaddle community. Apply... |
|
Experimental |
| 33 |
archity/doc-scanner
Computer Vision and NLP based document scanner, text extractor and summarizer. |
|
Experimental |
| 34 |
SivaPA08/text-capture
Captures screen regions, extracts text and copies it to the clipboard |
|
Experimental |
| 35 |
dev-sungman/recent-ocr-papers
this repo include paper review, code in text detection, text recognition,... |
|
Experimental |
| 36 |
esteininger/file-processor
A Python library that uses AI to convert unstructured files (like PDFs,... |
|
Experimental |
| 37 |
avirajsa/DocuMind
DocuMind - Python project for document analysis. Analyze, summarize, and... |
|
Experimental |
| 38 |
Keizouw8/OCR-Command-Line-Tool
A tool that can be used in the CLI or NodeJS environment to scan for text in... |
|
Experimental |
| 39 |
shubh11220/PDF-Text-Extraction
Create a data extraction platform for users to conveniently obtain data in a... |
|
Experimental |
| 40 |
DecisionNerd/docunderstand
A python system for Visually Rich Document Understanding |
|
Experimental |
| 41 |
mishaelaaa/OCR
This is a project in which I store all my attempts to create an application... |
|
Experimental |
| 42 |
Cool-fire/Snipps
📚 📝📜 A simple android app to convert information into digital snippets,... |
|
Experimental |
| 43 |
SundayOni/document-ocr-nlp-pipeline
End-to-end pipeline for extracting and structuring text from scanned, PDF... |
|
Experimental |
| 44 |
michael-borck/document-lens
Analyzes text documents for readability, academic integrity, and linguistic... |
|
Experimental |
| 45 |
husnutass/ml_kit_app
A Flutter mobile app to read data from business cards and save that data in... |
|
Experimental |
| 46 |
saloni-rangari/nlp-ocr-marathi
This mini-project implements Marathi handwritten text recognition using... |
|
Experimental |
| 47 |
emilyhasson/Text-Recognition
Scripts to convert low-quality scanned PDFs to text files using Google Cloud... |
|
Experimental |
| 48 |
HySonLab/TeBaAb
TeBaAb: Text-Based Antigen-Conditioned Antibody Redesign via Directed Evolution |
|
Experimental |
| 49 |
fdovila/PDF2TXT4NLP
an online Python web app that accepts academic articles in PDF format and... |
|
Experimental |
| 50 |
Prateek32177/TextlyAI
AI-powered tool to extract and classify text from images using OCR and... |
|
Experimental |
| 51 |
nicdriebe/ocr-ner-sharepic-evaluation
Bachelor's Thesis: Evaluation of open-source OCR and NER pipelines... |
|
Experimental |
| 52 |
iytedbb/OSPA-SuryaOCR
OSPA SuryaOCR – Advanced document processing framework for historical... |
|
Experimental |
| 53 |
Komorebirumu/awe-ms-20260315-2211-01
AI Historical Document Transcription & Analysis CLI Tool |
|
Experimental |
| 54 |
sushant1827/Mistral-OCR-PDF-Image
This workflow automates the extraction of structured information from PDFs... |
|
Experimental |
| 55 |
jacobmarks/pytesseract-ocr-plugin
Run optical character recognition with PyTesseract from the FiftyOne App! |
|
Experimental |
| 56 |
ITSAIDI/Textra
Welcome to the documentation repository for Textra tool. We combine OCR and... |
|
Experimental |
| 57 |
Duke-Chronicle-Project/awesome-historical-newspaper-analysis
Awesome historical newspaper analysis tools and literature |
|
Experimental |
| 58 |
Methila-Meem/AI_Invoice_Analyzer
Automatically extracts, validates, and structures invoice data from images... |
|
Experimental |
| 59 |
RamezCh/Arabic_Text_Extractor
This project is a fine tuned Tesseract OCR ara on Arabic handwriting with... |
|
Experimental |
| 60 |
jyothish-ram/invoice_ocr_api
Invoice OCR Extraction Flask API |
|
Experimental |
| 61 |
AmmarAhm3d/invoice-gemini-extracter
Invoice-Gemini-Extracter: Python tool to extract structured invoice data... |
|
Experimental |
| 62 |
keeganareeve/kr-ocr-project
Was developed in order to contribute to the COLRC online database. |
|
Experimental |
| 63 |
ohidurbappy/tabular-data-digitalizer
Project to convert handwritten tabular data to excel table |
|
Experimental |