ajithalbus/TamilCorpus
Open Source Tamil Corpus of 58M words
If you're working with the Tamil language and need a large collection of text for research or applications, this project provides a corpus of over 58 million words. It takes raw text data from sources like Wikipedia and The Hindu (Tamil) and offers it as a ready-to-use dataset. This is for linguists, researchers, or data scientists building language-based tools for Tamil.
No commits in the last 6 months.
Use this if you need a substantial amount of Tamil text data to analyze language patterns, train models, or build natural language processing applications.
Not ideal if you need a perfectly clean, pre-processed, or annotated dataset, as some additional cleansing might be required.
Stars
11
Forks
2
Language
Shell
License
GPL-3.0
Category
Last pushed
Jul 31, 2020
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/ajithalbus/TamilCorpus"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
Helsinki-NLP/OpusFilter
OpusFilter - Parallel corpus processing toolkit
natasha/corus
Links to Russian corpora + Python functions for loading and parsing
darija-open-dataset/dataset
darija <-> english dataset
omicsNLP/Auto-CORPus
Auto-CORPus pipeline developed by a University of Nottingham and Imperial College London...
SergeyShk/ruTS
Библиотека для извлечения статистик из текстов на русском языке.