Document Data Extraction LLM Tools
Tools for extracting, parsing, and converting structured data from unstructured documents (PDFs, images, invoices, etc.) using OCR and LLMs. Does NOT include general document summarization, web scraping, or downstream analytics applications.
There are 71 document data extraction tools tracked. 9 score above 50 (established tier). The highest-rated is NanoNets/docstrange at 60/100 with 1,379 stars. 2 of the top 10 are actively maintained.
Get all 71 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=document-data-extraction&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
NanoNets/docstrange
Extract and convert data from any document, images, pdfs, word doc, ppt or... |
|
Established |
| 2 |
th1nhhdk/local_ai_ocr
An local, offline (after initial setup), portable OCR software that can... |
|
Established |
| 3 |
Dicklesworthstone/llm_aided_ocr
Enhances Tesseract OCR output using LLMs (local or API) for error... |
|
Established |
| 4 |
emcf/thepipe
Get clean data from tricky documents, powered by vision-language models ⚡ |
|
Established |
| 5 |
langstruct-ai/langstruct
Extract structured data from any content using LLMs. |
|
Established |
| 6 |
hashangit/Extract2MD
Extract2MD is a powerful and versatile AI-enabled client-side JavaScript... |
|
Established |
| 7 |
CatchTheTornado/text-extract-api
Document (PDF, Word, PPTX ...) extraction and parse API using state of the... |
|
Established |
| 8 |
CambioML/uniflow
LLM-based text extraction from unstructured data like PDFs, Words and HTMLs.... |
|
Established |
| 9 |
Xyntopia/pydoxtools
Effortlessly extract information from unstructured data with this library,... |
|
Established |
| 10 |
Capevace/data-wizard
Extract structured data from PDFs, Word docs and images. Embeddable directly... |
|
Emerging |
| 11 |
enoch3712/ExtractThinker
ExtractThinker is a Document Intelligence library for LLMs, offering... |
|
Emerging |
| 12 |
langchain-ai/langchain-extract
🦜⛏️ Did you say you like data? |
|
Emerging |
| 13 |
QuivrHQ/MegaParse
File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx,... |
|
Emerging |
| 14 |
arshad-yaseen/ocr-llm
⚡️ Fast, ultra-accurate text extraction from any image or PDF—including... |
|
Emerging |
| 15 |
junhoyeo/BetterOCR
🔍 Better text detection by combining multiple OCR engines (EasyOCR,... |
|
Emerging |
| 16 |
LM-150A/docflash
⚡ AI-powered content intelligence with structured data extraction.... |
|
Emerging |
| 17 |
Traves-Theberge/webform-cli
A CLI tool for extracting unstructured data from websites using customizable... |
|
Emerging |
| 18 |
heripo-lab/heripo-engine
TypeScript library for extracting structured data from archaeological... |
|
Emerging |
| 19 |
Lazzzer/structurizer
Structurizer is a web application that helps you extract structured data... |
|
Emerging |
| 20 |
ShengjieJin/pdftrim-for-llm
An open-source Zotero plugin for vibe reading and LLM-assisted paper... |
|
Emerging |
| 21 |
kennethleungty/LangExtract-Gemma-Structured-Extraction
Using LangExtract and Gemma 3 for structured information extraction from... |
|
Emerging |
| 22 |
lias-laboratory/cidoccrm-llm-extractor
A tool for automating CIDOC CRM knowledge graph population using Large... |
|
Emerging |
| 23 |
mohanbing/st_doc_ext
This repository contains the code for the information extraction app that... |
|
Emerging |
| 24 |
phanxuanquang/XCan-AI
Extract the text, style, format, and layout from any images, even... |
|
Emerging |
| 25 |
messeb/py-openai-receipt-extractor
Extracts structured data from receipts via OpenAI API |
|
Emerging |
| 26 |
jamesmcroft/azure-document-intelligence-markdown-to-openai-data-extraction-sample
This sample demonstrates how to use Document Intelligence's Layout model to... |
|
Emerging |
| 27 |
jamesmcroft/ai-document-data-extraction-evaluation
This project demonstrates how to evaluate the use of LLMs and SLMs for... |
|
Emerging |
| 28 |
isaiah76/Reviewer
extracts text from pdfs and powerpoint documents and summarizes it into key... |
|
Emerging |
| 29 |
sabber-slt/NetExtract
NetExtract: Efficiently extract core content from any webpage and convert it... |
|
Emerging |
| 30 |
CambioML/any-parser
Accurate, private and configurable document retrieval LLM |
|
Emerging |
| 31 |
pranavgupta2603/SplitwiseGPTVision
SplitwiseGPT Vision: Streamline bill splitting with AI-driven image... |
|
Emerging |
| 32 |
AFLucas-UOM/Accurate-Name-Extraction
2026 IEEE Conference on Artificial Intelligence (CAI26) · A modular computer... |
|
Experimental |
| 33 |
mike-grant/intelliextract
Extract structured data from your unstructured data |
|
Experimental |
| 34 |
Ja-yy/Invoice-extractor
Streamlit app leveraging OpenAI's LLM for accurate invoice extraction,... |
|
Experimental |
| 35 |
wmahfoudh/crabocr
PDF and image to-text converter with XFA forms support. It extract embedded... |
|
Experimental |
| 36 |
QuartzUnit/docpick
Lightweight OCR + Local LLM → Schema-based Structured JSON Extraction |
|
Experimental |
| 37 |
Randika00/VisionGPT-Extractor
An AI-powered tool designed to extract structured data from documents,... |
|
Experimental |
| 38 |
lecuong1502/NanoOCR
NanoOCR — Internal document OCR system powered by GLM-OCR, with a FastAPI... |
|
Experimental |
| 39 |
lisstasy/Receipt_Scanner
Advanced receipt OCR and analysis using PaddleOCR, GPT-3.5-turbo, Plotly,... |
|
Experimental |
| 40 |
Nguyendu9096/langcore-api
Provide production-ready HTTP API for structured document extraction using... |
|
Experimental |
| 41 |
jaimvizalla01/aiwhisperer
📄 Optimize your large documents for AI analysis by converting and splitting... |
|
Experimental |
| 42 |
SH-Nihil-Mukkesh-25/fractaAI
FractaAI is a Streamlit-based application for exploring and visualizing text... |
|
Experimental |
| 43 |
isobarbaric/SnapTrack
a receipt CLI |
|
Experimental |
| 44 |
Tek233/Document-Processing-with-OCR
An agent for document processing using OCR |
|
Experimental |
| 45 |
voidpenguin-28/Textractor-ExtraExtensions
Several useful Textractor extensions, which are not available by default in... |
|
Experimental |
| 46 |
Jishnnu/InvoiceAI-Document-Parser
Simple Streamlit application that parses the data from Invoice images and... |
|
Experimental |
| 47 |
amit-timalsina/document_classification
All in one package for Document (image, pdf) Classification. Unified... |
|
Experimental |
| 48 |
ilyassuelen/InsightAI
InsightAI: Python-based document processing platform with chunking,... |
|
Experimental |
| 49 |
RPramodh/LLM-based-Invoice-Extractor
This repository hosts the source code for an Invoice Extractor application... |
|
Experimental |
| 50 |
mu373/vertex-ai-ocr
Convert scanned book images to Markdown with Gemini |
|
Experimental |
| 51 |
PMTheTechGuy/document-entity-extractor
AI-powered document extractor for names, emails, and organizations. License: MIT |
|
Experimental |
| 52 |
xiangjianxiaohuangyu/paper-extract-app
AI-powered desktop tool for extracting structured information from academic... |
|
Experimental |
| 53 |
agxp/docpulse
Async document intelligence API — upload any PDF/DOCX/image + a JSON Schema,... |
|
Experimental |
| 54 |
leadershop/marksheet-information-extraction-api
🎓 Extract and validate data from academic marksheets using AI for accurate... |
|
Experimental |
| 55 |
cucumberian/__ai_draft-parser
structured data extraction from drafts |
|
Experimental |
| 56 |
andyed/fascist-language-analyzer
langchain+langextract gemini-api breakdown of Project2025 text by Umberto... |
|
Experimental |
| 57 |
abdulmanafsahito/Vision-OCR
A general OCR and image-understanding web app. Upload an image, write a... |
|
Experimental |
| 58 |
haritha8503/langextract
🌐 Extract languages from text seamlessly using LangExtract. Simplify... |
|
Experimental |
| 59 |
junotb/omniparse-ai-stack
Document & image parsing full-stack demo. OCR, VLM, document layout... |
|
Experimental |
| 60 |
HTLinh0604/invoice_ai_automation
This project transforms messy invoice images into a structured, searchable... |
|
Experimental |
| 61 |
r0b0tan/document-ai-demo
Full-stack demo for AI-assisted document analysis. Upload a document and let... |
|
Experimental |
| 62 |
tahangz/Multimodal_OCR_LLM
This project is a user-friendly web application that allows you to upload... |
|
Experimental |
| 63 |
Tahsine/warmup-ernie-paddleocr
Demonstration of the AI pipeline for the conversion of structured documents:... |
|
Experimental |
| 64 |
giruu/TesserXtract.AI
This Flask application empowers users to seamlessly upload image files like... |
|
Experimental |
| 65 |
rririanto/unstructured-demo-streamlit
Extract your docs (CSV, PDF, JSON, HTML, DOCS, Sheets and more) for your own... |
|
Experimental |
| 66 |
suddeb/langextract
My experiment with langextract - a Python library for extracting structured... |
|
Experimental |
| 67 |
thinktecture-labs/llm-extract-structured-information-langchain-kor
Very simple sample that extracts JSON based on a schema from human text input. |
|
Experimental |
| 68 |
joery0x3b800001/Intelligent-Document-Verification-System
The application uses state-of-the-art NLP models for summarization and... |
|
Experimental |
| 69 |
Bang-tv259/LLM_Ngrok_Flask
OCR (Optical Character Recognition) on images and expose the functionality... |
|
Experimental |
| 70 |
ngtrdai/extractor
Extractor is a powerful tool that leverages the capabilities of Langchain to... |
|
Experimental |
| 71 |
fri3erg/DataDig-AIExtractor
App used to extract structured data from documents photos or pdfs via custom... |
|
Experimental |