sileod/tasksource
Datasets collection and preprocessings framework for NLP extreme multitask learning
This project helps machine learning engineers and researchers easily access and prepare a vast collection of NLP datasets for advanced model training. It takes raw text datasets and standardizes them into consistent formats (like multiple choice or classification tasks), making them instantly interchangeable. The ideal user is someone building or evaluating large language models who needs a wide range of consistently preprocessed data.
193 stars. Used by 1 other package. No commits in the last 6 months. Available on PyPI.
Use this if you need a standardized, large collection of NLP datasets ready for immediate use in multi-task learning, fine-tuning, or evaluating advanced text models.
Not ideal if you are a casual user looking for a simple, single dataset for a basic NLP task or if you lack disk space for large datasets.
Stars
193
Forks
11
Language
Python
License
CC-BY-4.0
Category
Last pushed
Jul 09, 2025
Commits (30d)
0
Dependencies
9
Reverse dependents
1
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/sileod/tasksource"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
luheng/deep_srl
Code and pre-trained model for: Deep Semantic Role Labeling: What Works and What's Next
loomchild/maligna
Bilingual sengence aligner
CK-Explorer/DuoSubs
Semantic subtitle aligner and merger for bilingual subtitle syncing.
coastalcph/lex-glue
LexGLUE: A Benchmark Dataset for Legal Language Understanding in English
PhilipMay/stsb-multi-mt
Machine translated multilingual STS benchmark dataset.