aphp/edspdf
EDS-PDF is a generic, pure-Python framework for text extraction from PDF documents. It provides the machinery to use rule- or machine-learning-based approaches to classify text blocs between body and meta-data.
This tool helps you accurately pull text content from PDF documents, even when they contain complex layouts like mixed body text and metadata. It takes a PDF file as input and outputs the extracted text, intelligently separated into categories like body content. This is ideal for researchers, data analysts, or anyone who regularly needs to process information locked in a large volume of PDFs.
No commits in the last 6 months. Available on PyPI.
Use this if you need to reliably extract specific types of text from a collection of PDFs, distinguishing between main content and other elements like headers, footers, or marginal notes.
Not ideal if you only need a basic, undifferentiated text dump from simple PDFs, as its advanced classification features might be overkill.
Stars
62
Forks
7
Language
Python
License
BSD-3-Clause
Category
Last pushed
Feb 12, 2025
Commits (30d)
0
Dependencies
22
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/aphp/edspdf"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
paperless-ngx/paperless-ngx
A community-supported supercharged document management system: scan, index and archive all your documents
GoogleCloudPlatform/document-ai-samples
Sample applications and demos for Document AI, the end-to-end document processing platform on...
aws-solutions/document-understanding-solution
Example of integrating & using Amazon Textract, Amazon Comprehend, Amazon Comprehend Medical,...
naiveHobo/InvoiceNet
Deep neural network to extract intelligent information from invoice documents.
ptmrio/autorename-pdf
autorename-pdf is a highly efficient tool designed to automatically rename and archive PDF...