marian-nmt/sotastream

A library for data streaming and augmentation

/ 100

Emerging

This tool helps machine translation researchers and developers prepare large datasets for training. It takes raw, often compressed, text data and generates an endless stream of shuffled and augmented data. This process is crucial for efficiently training machine translation models.

No commits in the last 6 months. Available on PyPI.

Use this if you are training machine translation models and need to efficiently prepare, augment, and stream large text datasets for your training pipelines.

Not ideal if your primary goal is general data processing or augmentation for domains outside of machine translation or other NLP tasks requiring text stream manipulation.

machine-translation NLP-training text-data-augmentation language-model-training

Stale 6m

Maintenance 2 / 25

Adoption 6 / 25

Maturity 25 / 25

Community 11 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Higher-rated alternatives

kermitt2/delft

a Deep Learning Framework for Text https://delft.readthedocs.io/

yoeo/guesslang

Detect the programming language of a source code

matthewdeanmartin/whats_that_code

detect programming language of source in pure python from an ensemble of classifiers

airalcorn2/Deep-Semantic-Similarity-Model

My Keras implementation of the Deep Semantic Similarity Model (DSSM)/Convolutional Latent...

christiansafka/img2vec

:fire: Use pre-trained models in PyTorch to extract vector embeddings for any image

Explore ML Frameworks

All categories Trending ML Framework directory Insights