M4t1ss/parallel-corpora-tools

Tools for filtering and cleaning parallel and monolingual corpora for machine translation and other natural language processing tasks.

/ 100

Emerging

This helps machine translation developers and researchers prepare high-quality text data for training language models. It takes raw parallel (source and target language) and monolingual text datasets as input and outputs cleaner, more relevant versions of these datasets. The end user is typically a natural language processing practitioner or machine translation engineer.

No commits in the last 6 months.

Use this if you need to improve the performance of your neural machine translation system by cleaning your training data.

Not ideal if you are looking for advanced data augmentation techniques or comprehensive deep learning model training frameworks.

machine-translation natural-language-processing data-preparation corpus-linguistics neural-networks-training

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 7 / 25

Maturity 16 / 25

Community 18 / 25

How are scores calculated?

Stars

Forks

Language

PHP

License

MIT

Higher-rated alternatives

Helsinki-NLP/OpusFilter

OpusFilter - Parallel corpus processing toolkit

natasha/corus

Links to Russian corpora + Python functions for loading and parsing

darija-open-dataset/dataset

darija <-> english dataset

omicsNLP/Auto-CORPus

Auto-CORPus pipeline developed by a University of Nottingham and Imperial College London...

SergeyShk/ruTS

Библиотека для извлечения статистик из текстов на русском языке.

Explore NLP Tools

All categories Trending NLP directory Insights