StabRise/ScaleDP

ScaleDP is an Open-Source extension of Apache Spark for Document Processing

42
/ 100
Emerging

This tool helps data engineers and AI/ML practitioners extract structured information from large volumes of unstructured documents like PDFs and images. It takes raw documents as input and uses AI models to pull out specific data fields (like invoice details or patient info) into a structured table format. This is ideal for anyone needing to automate data extraction at scale from diverse document types.

Available on PyPI.

Use this if you need to process vast collections of documents (PDFs, images) to extract specific, structured data using AI, and require the scalability of Apache Spark.

Not ideal if you only need to process a few documents or don't have an existing Apache Spark infrastructure.

document-processing data-extraction invoice-processing forms-automation information-retrieval
Maintenance 6 / 25
Adoption 6 / 25
Maturity 25 / 25
Community 5 / 25

How are scores calculated?

Stars

18

Forks

1

Language

Python

License

AGPL-3.0

Last pushed

Dec 02, 2025

Commits (30d)

0

Dependencies

22

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/StabRise/ScaleDP"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.