mosaicml/streaming

A Data Streaming Library for Efficient Neural Network Training

57
/ 100
Established

This tool helps machine learning engineers efficiently train large neural networks using datasets stored in cloud storage like AWS S3 or Google Cloud Storage. It takes raw data (images, text, video) in common formats like CSV, JSONL, or MDS, and streams it directly into PyTorch training workflows. This allows for faster and more scalable training, especially for large, distributed models.

1,472 stars.

Use this if you are a machine learning engineer training large models with datasets stored in cloud object storage and need to improve training speed and scalability.

Not ideal if you are a data scientist primarily working with small datasets on a local machine or not using PyTorch for neural network training.

deep-learning-training cloud-mlops distributed-ml large-scale-data-processing
No Package No Dependents
Maintenance 10 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 21 / 25

How are scores calculated?

Stars

1,472

Forks

189

Language

Python

License

Apache-2.0

Last pushed

Feb 02, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/mosaicml/streaming"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.