M4t1ss/parallel-corpora-tools
Tools for filtering and cleaning parallel and monolingual corpora for machine translation and other natural language processing tasks.
This helps machine translation developers and researchers prepare high-quality text data for training language models. It takes raw parallel (source and target language) and monolingual text datasets as input and outputs cleaner, more relevant versions of these datasets. The end user is typically a natural language processing practitioner or machine translation engineer.
No commits in the last 6 months.
Use this if you need to improve the performance of your neural machine translation system by cleaning your training data.
Not ideal if you are looking for advanced data augmentation techniques or comprehensive deep learning model training frameworks.
Stars
41
Forks
16
Language
PHP
License
MIT
Category
Last pushed
Dec 19, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/M4t1ss/parallel-corpora-tools"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
Helsinki-NLP/OpusFilter
OpusFilter - Parallel corpus processing toolkit
natasha/corus
Links to Russian corpora + Python functions for loading and parsing
darija-open-dataset/dataset
darija <-> english dataset
omicsNLP/Auto-CORPus
Auto-CORPus pipeline developed by a University of Nottingham and Imperial College London...
SergeyShk/ruTS
Библиотека для извлечения статистик из текстов на русском языке.