StabRise/ScaleDP
ScaleDP is an Open-Source extension of Apache Spark for Document Processing
This tool helps data engineers and AI/ML practitioners extract structured information from large volumes of unstructured documents like PDFs and images. It takes raw documents as input and uses AI models to pull out specific data fields (like invoice details or patient info) into a structured table format. This is ideal for anyone needing to automate data extraction at scale from diverse document types.
Available on PyPI.
Use this if you need to process vast collections of documents (PDFs, images) to extract specific, structured data using AI, and require the scalability of Apache Spark.
Not ideal if you only need to process a few documents or don't have an existing Apache Spark infrastructure.
Stars
18
Forks
1
Language
Python
License
AGPL-3.0
Category
Last pushed
Dec 02, 2025
Commits (30d)
0
Dependencies
22
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/StabRise/ScaleDP"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
paperless-ngx/paperless-ngx
A community-supported supercharged document management system: scan, index and archive all your documents
GoogleCloudPlatform/document-ai-samples
Sample applications and demos for Document AI, the end-to-end document processing platform on...
aws-solutions/document-understanding-solution
Example of integrating & using Amazon Textract, Amazon Comprehend, Amazon Comprehend Medical,...
naiveHobo/InvoiceNet
Deep neural network to extract intelligent information from invoice documents.
aphp/edspdf
EDS-PDF is a generic, pure-Python framework for text extraction from PDF documents. It provides...