fsxfreak/nlp-augment
A collection of utilities used in exploring data augmentation of low-resource parallel corpuses.
This tool helps computational linguists and machine translation researchers expand small datasets of translated text for languages where examples are scarce. It takes your existing parallel corpus (text in two languages) and a larger body of monolingual text in your source language. It then generates additional, augmented sentence pairs to improve the training of machine translation models.
No commits in the last 6 months.
Use this if you are working with low-resource languages and need to enhance your parallel training data to improve machine translation quality.
Not ideal if you have abundant parallel data for your language pair or are not focused on improving machine translation performance for low-resource languages.
Stars
11
Forks
3
Language
Python
License
MIT
Category
Last pushed
Sep 06, 2017
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/fsxfreak/nlp-augment"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
MinishLab/model2vec
Fast State-of-the-Art Static Embeddings
AnswerDotAI/ModernBERT
Bringing BERT into modernity via both architecture changes and scaling
tensorflow/hub
A library for transfer learning by reusing parts of TensorFlow models.
Embedding/Chinese-Word-Vectors
100+ Chinese Word Vectors 上百种预训练中文词向量
twang2218/vocab-coverage
语言模型中文认知能力分析