Document Data Extraction LLM Tools

Tools for extracting, parsing, and converting structured data from unstructured documents (PDFs, images, invoices, etc.) using OCR and LLMs. Does NOT include general document summarization, web scraping, or downstream analytics applications.

There are 71 document data extraction tools tracked. 9 score above 50 (established tier). The highest-rated is NanoNets/docstrange at 60/100 with 1,379 stars. 2 of the top 10 are actively maintained.

Get all 71 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=document-data-extraction&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 NanoNets/docstrange

Extract and convert data from any document, images, pdfs, word doc, ppt or...

60
Established
2 th1nhhdk/local_ai_ocr

An local, offline (after initial setup), portable OCR software that can...

58
Established
3 Dicklesworthstone/llm_aided_ocr

Enhances Tesseract OCR output using LLMs (local or API) for error...

58
Established
4 emcf/thepipe

Get clean data from tricky documents, powered by vision-language models ⚡

56
Established
5 langstruct-ai/langstruct

Extract structured data from any content using LLMs.

55
Established
6 hashangit/Extract2MD

Extract2MD is a powerful and versatile AI-enabled client-side JavaScript...

52
Established
7 CatchTheTornado/text-extract-api

Document (PDF, Word, PPTX ...) extraction and parse API using state of the...

51
Established
8 CambioML/uniflow

LLM-based text extraction from unstructured data like PDFs, Words and HTMLs....

50
Established
9 Xyntopia/pydoxtools

Effortlessly extract information from unstructured data with this library,...

50
Established
10 Capevace/data-wizard

Extract structured data from PDFs, Word docs and images. Embeddable directly...

49
Emerging
11 enoch3712/ExtractThinker

ExtractThinker is a Document Intelligence library for LLMs, offering...

48
Emerging
12 langchain-ai/langchain-extract

🦜⛏️ Did you say you like data?

45
Emerging
13 QuivrHQ/MegaParse

File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx,...

44
Emerging
14 arshad-yaseen/ocr-llm

⚡️ Fast, ultra-accurate text extraction from any image or PDF—including...

43
Emerging
15 junhoyeo/BetterOCR

🔍 Better text detection by combining multiple OCR engines (EasyOCR,...

42
Emerging
16 LM-150A/docflash

⚡ AI-powered content intelligence with structured data extraction....

41
Emerging
17 Traves-Theberge/webform-cli

A CLI tool for extracting unstructured data from websites using customizable...

39
Emerging
18 heripo-lab/heripo-engine

TypeScript library for extracting structured data from archaeological...

39
Emerging
19 Lazzzer/structurizer

Structurizer is a web application that helps you extract structured data...

38
Emerging
20 ShengjieJin/pdftrim-for-llm

An open-source Zotero plugin for vibe reading and LLM-assisted paper...

37
Emerging
21 kennethleungty/LangExtract-Gemma-Structured-Extraction

Using LangExtract and Gemma 3 for structured information extraction from...

37
Emerging
22 lias-laboratory/cidoccrm-llm-extractor

A tool for automating CIDOC CRM knowledge graph population using Large...

34
Emerging
23 mohanbing/st_doc_ext

This repository contains the code for the information extraction app that...

34
Emerging
24 phanxuanquang/XCan-AI

Extract the text, style, format, and layout from any images, even...

34
Emerging
25 messeb/py-openai-receipt-extractor

Extracts structured data from receipts via OpenAI API

33
Emerging
26 jamesmcroft/azure-document-intelligence-markdown-to-openai-data-extraction-sample

This sample demonstrates how to use Document Intelligence's Layout model to...

33
Emerging
27 jamesmcroft/ai-document-data-extraction-evaluation

This project demonstrates how to evaluate the use of LLMs and SLMs for...

33
Emerging
28 isaiah76/Reviewer

extracts text from pdfs and powerpoint documents and summarizes it into key...

32
Emerging
29 sabber-slt/NetExtract

NetExtract: Efficiently extract core content from any webpage and convert it...

32
Emerging
30 CambioML/any-parser

Accurate, private and configurable document retrieval LLM

31
Emerging
31 pranavgupta2603/SplitwiseGPTVision

SplitwiseGPT Vision: Streamline bill splitting with AI-driven image...

30
Emerging
32 AFLucas-UOM/Accurate-Name-Extraction

2026 IEEE Conference on Artificial Intelligence (CAI26) · A modular computer...

29
Experimental
33 mike-grant/intelliextract

Extract structured data from your unstructured data

26
Experimental
34 Ja-yy/Invoice-extractor

Streamlit app leveraging OpenAI's LLM for accurate invoice extraction,...

24
Experimental
35 wmahfoudh/crabocr

PDF and image to-text converter with XFA forms support. It extract embedded...

24
Experimental
36 QuartzUnit/docpick

Lightweight OCR + Local LLM → Schema-based Structured JSON Extraction

22
Experimental
37 Randika00/VisionGPT-Extractor

An AI-powered tool designed to extract structured data from documents,...

22
Experimental
38 lecuong1502/NanoOCR

NanoOCR — Internal document OCR system powered by GLM-OCR, with a FastAPI...

22
Experimental
39 lisstasy/Receipt_Scanner

Advanced receipt OCR and analysis using PaddleOCR, GPT-3.5-turbo, Plotly,...

22
Experimental
40 Nguyendu9096/langcore-api

Provide production-ready HTTP API for structured document extraction using...

22
Experimental
41 jaimvizalla01/aiwhisperer

📄 Optimize your large documents for AI analysis by converting and splitting...

21
Experimental
42 SH-Nihil-Mukkesh-25/fractaAI

FractaAI is a Streamlit-based application for exploring and visualizing text...

21
Experimental
43 isobarbaric/SnapTrack

a receipt CLI

21
Experimental
44 Tek233/Document-Processing-with-OCR

An agent for document processing using OCR

21
Experimental
45 voidpenguin-28/Textractor-ExtraExtensions

Several useful Textractor extensions, which are not available by default in...

21
Experimental
46 Jishnnu/InvoiceAI-Document-Parser

Simple Streamlit application that parses the data from Invoice images and...

20
Experimental
47 amit-timalsina/document_classification

All in one package for Document (image, pdf) Classification. Unified...

20
Experimental
48 ilyassuelen/InsightAI

InsightAI: Python-based document processing platform with chunking,...

19
Experimental
49 RPramodh/LLM-based-Invoice-Extractor

This repository hosts the source code for an Invoice Extractor application...

19
Experimental
50 mu373/vertex-ai-ocr

Convert scanned book images to Markdown with Gemini

18
Experimental
51 PMTheTechGuy/document-entity-extractor

AI-powered document extractor for names, emails, and organizations. License: MIT

18
Experimental
52 xiangjianxiaohuangyu/paper-extract-app

AI-powered desktop tool for extracting structured information from academic...

16
Experimental
53 agxp/docpulse

Async document intelligence API — upload any PDF/DOCX/image + a JSON Schema,...

15
Experimental
54 leadershop/marksheet-information-extraction-api

🎓 Extract and validate data from academic marksheets using AI for accurate...

14
Experimental
55 cucumberian/__ai_draft-parser

structured data extraction from drafts

14
Experimental
56 andyed/fascist-language-analyzer

langchain+langextract gemini-api breakdown of Project2025 text by Umberto...

14
Experimental
57 abdulmanafsahito/Vision-OCR

A general OCR and image-understanding web app. Upload an image, write a...

13
Experimental
58 haritha8503/langextract

🌐 Extract languages from text seamlessly using LangExtract. Simplify...

13
Experimental
59 junotb/omniparse-ai-stack

Document & image parsing full-stack demo. OCR, VLM, document layout...

13
Experimental
60 HTLinh0604/invoice_ai_automation

This project transforms messy invoice images into a structured, searchable...

13
Experimental
61 r0b0tan/document-ai-demo

Full-stack demo for AI-assisted document analysis. Upload a document and let...

13
Experimental
62 tahangz/Multimodal_OCR_LLM

This project is a user-friendly web application that allows you to upload...

13
Experimental
63 Tahsine/warmup-ernie-paddleocr

Demonstration of the AI pipeline for the conversion of structured documents:...

12
Experimental
64 giruu/TesserXtract.AI

This Flask application empowers users to seamlessly upload image files like...

12
Experimental
65 rririanto/unstructured-demo-streamlit

Extract your docs (CSV, PDF, JSON, HTML, DOCS, Sheets and more) for your own...

12
Experimental
66 suddeb/langextract

My experiment with langextract - a Python library for extracting structured...

11
Experimental
67 thinktecture-labs/llm-extract-structured-information-langchain-kor

Very simple sample that extracts JSON based on a schema from human text input.

11
Experimental
68 joery0x3b800001/Intelligent-Document-Verification-System

The application uses state-of-the-art NLP models for summarization and...

11
Experimental
69 Bang-tv259/LLM_Ngrok_Flask

OCR (Optical Character Recognition) on images and expose the functionality...

10
Experimental
70 ngtrdai/extractor

Extractor is a powerful tool that leverages the capabilities of Langchain to...

10
Experimental
71 fri3erg/DataDig-AIExtractor

App used to extract structured data from documents photos or pdfs via custom...

10
Experimental