PaddlePaddle/PaddleOCR
Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.
This tool helps you convert any image or PDF document into structured data like Markdown or JSON. It accurately extracts text and layout information from even challenging documents, making it ready for use in advanced AI applications. Marketing analysts, operations managers, and data entry specialists can use this to automate data extraction from various documents.
72,167 stars. Used by 10 other packages. Actively maintained with 12 commits in the last 30 days. Available on PyPI.
Use this if you need to reliably extract text and structural information from documents, especially those that are scanned, warped, or photographed, and want to use that data for AI applications.
Not ideal if you only need basic text copying and pasting, or if your documents are already in a perfectly editable digital format.
Stars
72,167
Forks
9,954
Language
Python
License
Apache-2.0
Category
Last pushed
Mar 12, 2026
Commits (30d)
12
Dependencies
4
Reverse dependents
10
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/PaddlePaddle/PaddleOCR"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Recent Releases
Compare
Related tools
kreuzberg-dev/kreuzberg
A polyglot document intelligence framework with a Rust core. Extract text, metadata, and...
yfedoseev/pdf_oxide
The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown...
opendataloader-project/opendataloader-pdf
PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.
AKSarav/pdfstract
PDFStract - The Extraction and Chunking Layer in Your RAG Pipeline - Available as CLI - WEBUI - API
NanoNets/docext
An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking...