M4t1ss/parallel-corpora-tools

Tools for filtering and cleaning parallel and monolingual corpora for machine translation and other natural language processing tasks.

41
/ 100
Emerging

This helps machine translation developers and researchers prepare high-quality text data for training language models. It takes raw parallel (source and target language) and monolingual text datasets as input and outputs cleaner, more relevant versions of these datasets. The end user is typically a natural language processing practitioner or machine translation engineer.

No commits in the last 6 months.

Use this if you need to improve the performance of your neural machine translation system by cleaning your training data.

Not ideal if you are looking for advanced data augmentation techniques or comprehensive deep learning model training frameworks.

machine-translation natural-language-processing data-preparation corpus-linguistics neural-networks-training
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 7 / 25
Maturity 16 / 25
Community 18 / 25

How are scores calculated?

Stars

41

Forks

16

Language

PHP

License

MIT

Last pushed

Dec 19, 2023

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/M4t1ss/parallel-corpora-tools"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.