amir9ume/urdu_ghazals_rekhta
Dataset for Urdu Ghazals
This project provides a collection of classical Urdu ghazals, a popular form of South Asian poetry, meticulously organized by author and available in Urdu, Hindi, and English transliteration. It's designed to offer text data for those working on natural language processing tasks, particularly for Urdu, which is considered a 'low-resource' language. Researchers and students in computational linguistics or digital humanities focusing on South Asian languages would find this useful.
No commits in the last 6 months.
Use this if you are a researcher or student in computational linguistics looking for a structured dataset of Urdu ghazals to analyze or experiment with language models.
Not ideal if you are trying to train a large-scale transformer model from scratch, as the dataset size is relatively small for such an intensive task.
Stars
20
Forks
9
Language
Jupyter Notebook
License
MIT
Category
Last pushed
Aug 14, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/amir9ume/urdu_ghazals_rekhta"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
acl-org/acl-anthology
Data and software for building the ACL Anthology.
anoopkunchukuttan/indic_nlp_library
Resources and tools for Indian language Natural Language Processing
CLUEbenchmark/CLUECorpus2020
Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料
KennethEnevoldsen/scandinavian-embedding-benchmark
A Scandinavian Benchmark for sentence embeddings
Separius/awesome-sentence-embedding
A curated list of pretrained sentence and word embedding models