marian-nmt/sotastream
A library for data streaming and augmentation
This tool helps machine translation researchers and developers prepare large datasets for training. It takes raw, often compressed, text data and generates an endless stream of shuffled and augmented data. This process is crucial for efficiently training machine translation models.
No commits in the last 6 months. Available on PyPI.
Use this if you are training machine translation models and need to efficiently prepare, augment, and stream large text datasets for your training pipelines.
Not ideal if your primary goal is general data processing or augmentation for domains outside of machine translation or other NLP tasks requiring text stream manipulation.
Stars
21
Forks
3
Language
Python
License
MIT
Category
Last pushed
May 05, 2025
Commits (30d)
0
Dependencies
4
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/marian-nmt/sotastream"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
kermitt2/delft
a Deep Learning Framework for Text https://delft.readthedocs.io/
yoeo/guesslang
Detect the programming language of a source code
matthewdeanmartin/whats_that_code
detect programming language of source in pure python from an ensemble of classifiers
airalcorn2/Deep-Semantic-Similarity-Model
My Keras implementation of the Deep Semantic Similarity Model (DSSM)/Convolutional Latent...
christiansafka/img2vec
:fire: Use pre-trained models in PyTorch to extract vector embeddings for any image