jennis0/burdoc
Advanced PDF parsing for python
This tool helps you accurately extract structured text, images, and tables from PDF documents, preserving their original reading order and context. It takes a PDF file as input and provides a comprehensive JSON output that includes headings, paragraphs, lists, tables, and images, along with metadata like fonts and bounding boxes. This is ideal for data analysts, researchers, or anyone who needs to process large volumes of information locked in PDFs for further analysis or database ingestion.
No commits in the last 6 months. Available on PyPI.
Use this if you need to reliably pull out specific data points, complex tables, or the full semantic structure from a PDF for data processing and analysis.
Not ideal if your PDFs are scanned images (OCR-dependent), contain right-to-left text, have highly complex graphical layouts, or include interactive forms.
Stars
12
Forks
3
Language
HTML
License
MIT
Category
Last pushed
Jan 21, 2025
Commits (30d)
0
Dependencies
11
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/jennis0/burdoc"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
paperless-ngx/paperless-ngx
A community-supported supercharged document management system: scan, index and archive all your documents
GoogleCloudPlatform/document-ai-samples
Sample applications and demos for Document AI, the end-to-end document processing platform on...
aws-solutions/document-understanding-solution
Example of integrating & using Amazon Textract, Amazon Comprehend, Amazon Comprehend Medical,...
naiveHobo/InvoiceNet
Deep neural network to extract intelligent information from invoice documents.
aphp/edspdf
EDS-PDF is a generic, pure-Python framework for text extraction from PDF documents. It provides...