uber/petastorm
Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
This project helps machine learning engineers efficiently train and evaluate deep learning models using large datasets stored in Apache Parquet format. It takes raw data in Parquet files and outputs data ready for training models in frameworks like TensorFlow or PyTorch. This is for machine learning practitioners who work with large-scale data and need to feed it into their deep learning workflows.
1,880 stars. Available on PyPI.
Use this if you are a machine learning engineer building deep learning models and need to ingest large datasets stored in Apache Parquet format for training or evaluation.
Not ideal if your deep learning workflows don't involve Apache Parquet data or if you are not working with Python-based machine learning frameworks.
Stars
1,880
Forks
286
Language
Python
License
Apache-2.0
Category
Last pushed
Jan 02, 2026
Commits (30d)
0
Dependencies
13
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/uber/petastorm"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related frameworks
treeverse/dvc
🦉 Data Versioning and ML Experiments
runpod/runpod-python
🐍 | Python library for RunPod API and serverless worker SDK.
microsoft/vscode-jupyter
VS Code Jupyter extension
4paradigm/OpenMLDB
OpenMLDB is an open-source machine learning database that provides a feature platform computing...
nuhame/mlpug
MLPug is a library for training and evaluating Machine Learning (ML) models, able to use...