marian-nmt/sotastream

A library for data streaming and augmentation

44
/ 100
Emerging

This tool helps machine translation researchers and developers prepare large datasets for training. It takes raw, often compressed, text data and generates an endless stream of shuffled and augmented data. This process is crucial for efficiently training machine translation models.

No commits in the last 6 months. Available on PyPI.

Use this if you are training machine translation models and need to efficiently prepare, augment, and stream large text datasets for your training pipelines.

Not ideal if your primary goal is general data processing or augmentation for domains outside of machine translation or other NLP tasks requiring text stream manipulation.

machine-translation NLP-training text-data-augmentation language-model-training
Stale 6m
Maintenance 2 / 25
Adoption 6 / 25
Maturity 25 / 25
Community 11 / 25

How are scores calculated?

Stars

21

Forks

3

Language

Python

License

MIT

Last pushed

May 05, 2025

Commits (30d)

0

Dependencies

4

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/marian-nmt/sotastream"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.