Kartikaggarwal98/Indian_ParallelCorpus
Curated list of publicly available parallel corpus for Indian Languages
This is a curated list of publicly available parallel text datasets for numerous Indian languages, including Hindi, Bengali, Tamil, Telugu, and more. It helps you find sentence pairs where the same text is translated across two different languages. Practitioners like linguists, localization managers, or AI researchers use this to train machine translation models or develop multilingual applications.
No commits in the last 6 months.
Use this if you need to find existing parallel text corpora for training machine translation systems or other cross-lingual natural language processing tasks involving Indian languages.
Not ideal if you are looking for tools to create parallel corpora or if your focus is on monolingual text data.
Stars
37
Forks
5
Language
—
License
—
Category
Last pushed
Jul 15, 2021
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/Kartikaggarwal98/Indian_ParallelCorpus"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
Helsinki-NLP/OpusFilter
OpusFilter - Parallel corpus processing toolkit
natasha/corus
Links to Russian corpora + Python functions for loading and parsing
SergeyShk/ruTS
Библиотека для извлечения статистик из текстов на русском языке.
darija-open-dataset/dataset
darija <-> english dataset
omicsNLP/Auto-CORPus
Auto-CORPus pipeline developed by a University of Nottingham and Imperial College London...