getyourguide/DDataFlow

A tool to help you to test and develop pyspark code with sampled and local data

/ 100

Emerging

When building machine learning models or data pipelines using PySpark, this tool helps you develop and test your code more efficiently. It takes your full production data sources, samples them down for faster processing, and outputs results to a test location, preventing any accidental changes to live data. Data scientists and data engineers working with PySpark will find this useful for their daily development and testing workflows.

Available on PyPI.

Use this if you are a data scientist or engineer building PySpark-based machine learning or data pipelines and need a way to develop and test your code quickly and safely with realistic, sampled data.

Not ideal if you need to run tests against your full production dataset or if you are not working with PySpark for your data pipelines.

data-engineering machine-learning-engineering pyspark-development data-pipeline-testing ml-workflow-testing

Maintenance 10 / 25

Adoption 6 / 25

Maturity 25 / 25

Community 5 / 25

How are scores calculated?

Stars

Forks

Language

HTML

License

Apache-2.0

Higher-rated alternatives

scverse/anndata

Annotated data.

koaning/scikit-lego

Extra blocks for scikit-learn pipelines.

googleapis/python-bigquery-dataframes

BigQuery DataFrames (also known as BigFrames)

bigmlcom/python

Python bindings for BigML.io

posit-dev/orbital

Turn SciKitLearn pipelines into SQL

Explore ML Frameworks

All categories Trending ML Framework directory Insights