uber/petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

64
/ 100
Established

This project helps machine learning engineers efficiently train and evaluate deep learning models using large datasets stored in Apache Parquet format. It takes raw data in Parquet files and outputs data ready for training models in frameworks like TensorFlow or PyTorch. This is for machine learning practitioners who work with large-scale data and need to feed it into their deep learning workflows.

1,880 stars. Available on PyPI.

Use this if you are a machine learning engineer building deep learning models and need to ingest large datasets stored in Apache Parquet format for training or evaluation.

Not ideal if your deep learning workflows don't involve Apache Parquet data or if you are not working with Python-based machine learning frameworks.

deep-learning machine-learning-engineering data-ingestion model-training big-data-ml
Maintenance 6 / 25
Adoption 10 / 25
Maturity 25 / 25
Community 23 / 25

How are scores calculated?

Stars

1,880

Forks

286

Language

Python

License

Apache-2.0

Last pushed

Jan 02, 2026

Commits (30d)

0

Dependencies

13

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/uber/petastorm"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.