DOLMA-NLP/PARME

Parallel corpora for Middle Eastern languages - ACL2025

/ 100

Experimental

This project offers translation datasets for eight under-resourced Middle Eastern languages, including Luri Bakhtiari, Gilaki, and Zazaki. It provides parallel sentences in formats like TSV and JSONL, where one column is English, another Farsi, and a third one of the specified Middle Eastern language. Language researchers, machine translation developers, and linguists working on these languages would find this valuable for building or improving translation models.

No commits in the last 6 months.

Use this if you need high-quality, human-translated parallel sentence pairs to develop or enhance machine translation systems for specific Middle Eastern languages.

Not ideal if you are looking for general-purpose, off-the-shelf translation software, or if your target language is not one of the eight specific Middle Eastern languages covered.

machine-translation low-resource-languages linguistic-research middle-eastern-languages parallel-corpora

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 4 / 25

Maturity 16 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

Python

License

MIT

Higher-rated alternatives

Helsinki-NLP/OpusFilter

OpusFilter - Parallel corpus processing toolkit

natasha/corus

Links to Russian corpora + Python functions for loading and parsing

SergeyShk/ruTS

Библиотека для извлечения статистик из текстов на русском языке.

darija-open-dataset/dataset

darija <-> english dataset

omicsNLP/Auto-CORPus

Auto-CORPus pipeline developed by a University of Nottingham and Imperial College London...

Explore NLP Tools

All categories Trending NLP directory Insights