uber/petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

/ 100

Established

This project helps machine learning engineers efficiently train and evaluate deep learning models using large datasets stored in Apache Parquet format. It takes raw data in Parquet files and outputs data ready for training models in frameworks like TensorFlow or PyTorch. This is for machine learning practitioners who work with large-scale data and need to feed it into their deep learning workflows.

1,880 stars. Available on PyPI.

Use this if you are a machine learning engineer building deep learning models and need to ingest large datasets stored in Apache Parquet format for training or evaluation.

Not ideal if your deep learning workflows don't involve Apache Parquet data or if you are not working with Python-based machine learning frameworks.

deep-learning machine-learning-engineering data-ingestion model-training big-data-ml

Maintenance 6 / 25

Adoption 10 / 25

Maturity 25 / 25

Community 23 / 25

How are scores calculated?

Stars

1,880

Forks

286

Language

Python

License

Apache-2.0

Related frameworks

treeverse/dvc

🦉 Data Versioning and ML Experiments

runpod/runpod-python

🐍 | Python library for RunPod API and serverless worker SDK.

microsoft/vscode-jupyter

VS Code Jupyter extension

4paradigm/OpenMLDB

OpenMLDB is an open-source machine learning database that provides a feature platform computing...

nuhame/mlpug

MLPug is a library for training and evaluating Machine Learning (ML) models, able to use...

Explore ML Frameworks

All categories Trending ML Framework directory Insights