jennis0/burdoc

Advanced PDF parsing for python

44
/ 100
Emerging

This tool helps you accurately extract structured text, images, and tables from PDF documents, preserving their original reading order and context. It takes a PDF file as input and provides a comprehensive JSON output that includes headings, paragraphs, lists, tables, and images, along with metadata like fonts and bounding boxes. This is ideal for data analysts, researchers, or anyone who needs to process large volumes of information locked in PDFs for further analysis or database ingestion.

No commits in the last 6 months. Available on PyPI.

Use this if you need to reliably pull out specific data points, complex tables, or the full semantic structure from a PDF for data processing and analysis.

Not ideal if your PDFs are scanned images (OCR-dependent), contain right-to-left text, have highly complex graphical layouts, or include interactive forms.

data-extraction document-analysis information-retrieval content-management data-wrangling
Stale 6m
Maintenance 0 / 25
Adoption 5 / 25
Maturity 25 / 25
Community 14 / 25

How are scores calculated?

Stars

12

Forks

3

Language

HTML

License

MIT

Last pushed

Jan 21, 2025

Commits (30d)

0

Dependencies

11

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/jennis0/burdoc"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.