Huang-lab/figure-extractor
Flask-based service using PDFFigures 2.0 to extract figures and tables from scholarly PDFs. Features REST API, CLI, Docker support, and JSON metadata output (~1.5s/page processing). Designed for document processing and RAG pipelines.
This tool helps researchers, data scientists, or content managers automatically pull out figures, tables, and their captions from scholarly PDF documents. You feed it research papers in PDF format, and it outputs the extracted images and structured metadata (like captions and coordinates) for each figure and table in JSON format. It's designed for anyone working with large collections of academic papers who needs to analyze or reuse their visual content.
Use this if you need to programmatically extract visual content like graphs, charts, and data tables from scientific or academic PDFs for further analysis or integration into other systems.
Not ideal if you only need to view PDFs or manually extract a few figures, as this tool is designed for automated, high-volume processing.
Stars
15
Forks
2
Language
Python
License
—
Category
Last pushed
Dec 29, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/Huang-lab/figure-extractor"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
kreuzberg-dev/kreuzberg
A polyglot document intelligence framework with a Rust core. Extract text, metadata, and...
PaddlePaddle/PaddleOCR
Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR...
yfedoseev/pdf_oxide
The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown...
opendataloader-project/opendataloader-pdf
PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.
AKSarav/pdfstract
PDFStract - The Extraction and Chunking Layer in Your RAG Pipeline - Available as CLI - WEBUI - API