StabRise/ScaleDP

ScaleDP is an Open-Source extension of Apache Spark for Document Processing

/ 100

Emerging

This tool helps data engineers and AI/ML practitioners extract structured information from large volumes of unstructured documents like PDFs and images. It takes raw documents as input and uses AI models to pull out specific data fields (like invoice details or patient info) into a structured table format. This is ideal for anyone needing to automate data extraction at scale from diverse document types.

Available on PyPI.

Use this if you need to process vast collections of documents (PDFs, images) to extract specific, structured data using AI, and require the scalability of Apache Spark.

Not ideal if you only need to process a few documents or don't have an existing Apache Spark infrastructure.

document-processing data-extraction invoice-processing forms-automation information-retrieval

Maintenance 6 / 25

Adoption 6 / 25

Maturity 25 / 25

Community 5 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

AGPL-3.0

Higher-rated alternatives

paperless-ngx/paperless-ngx

A community-supported supercharged document management system: scan, index and archive all your documents

GoogleCloudPlatform/document-ai-samples

Sample applications and demos for Document AI, the end-to-end document processing platform on...

aws-solutions/document-understanding-solution

Example of integrating & using Amazon Textract, Amazon Comprehend, Amazon Comprehend Medical,...

naiveHobo/InvoiceNet

Deep neural network to extract intelligent information from invoice documents.

aphp/edspdf

EDS-PDF is a generic, pure-Python framework for text extraction from PDF documents. It provides...

Explore ML Frameworks

All categories Trending ML Framework directory Insights