getyourguide/DDataFlow

A tool to help you to test and develop pyspark code with sampled and local data

46
/ 100
Emerging

When building machine learning models or data pipelines using PySpark, this tool helps you develop and test your code more efficiently. It takes your full production data sources, samples them down for faster processing, and outputs results to a test location, preventing any accidental changes to live data. Data scientists and data engineers working with PySpark will find this useful for their daily development and testing workflows.

Available on PyPI.

Use this if you are a data scientist or engineer building PySpark-based machine learning or data pipelines and need a way to develop and test your code quickly and safely with realistic, sampled data.

Not ideal if you need to run tests against your full production dataset or if you are not working with PySpark for your data pipelines.

data-engineering machine-learning-engineering pyspark-development data-pipeline-testing ml-workflow-testing
Maintenance 10 / 25
Adoption 6 / 25
Maturity 25 / 25
Community 5 / 25

How are scores calculated?

Stars

15

Forks

1

Language

HTML

License

Apache-2.0

Last pushed

Feb 05, 2026

Commits (30d)

0

Dependencies

5

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/getyourguide/DDataFlow"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.