sileod/tasksource

Datasets collection and preprocessings framework for NLP extreme multitask learning

/ 100

Emerging

This project helps machine learning engineers and researchers easily access and prepare a vast collection of NLP datasets for advanced model training. It takes raw text datasets and standardizes them into consistent formats (like multiple choice or classification tasks), making them instantly interchangeable. The ideal user is someone building or evaluating large language models who needs a wide range of consistently preprocessed data.

193 stars. Used by 1 other package. No commits in the last 6 months. Available on PyPI.

Use this if you need a standardized, large collection of NLP datasets ready for immediate use in multi-task learning, fine-tuning, or evaluating advanced text models.

Not ideal if you are a casual user looking for a simple, single dataset for a basic NLP task or if you lack disk space for large datasets.

natural-language-processing machine-learning-engineering text-classification multi-task-learning model-evaluation

Stale 6m

Maintenance 2 / 25

Adoption 11 / 25

Maturity 25 / 25

Community 10 / 25

How are scores calculated?

Stars

193

Forks

Language

Python

License

CC-BY-4.0

Higher-rated alternatives

luheng/deep_srl

Code and pre-trained model for: Deep Semantic Role Labeling: What Works and What's Next

loomchild/maligna

Bilingual sengence aligner

CK-Explorer/DuoSubs

Semantic subtitle aligner and merger for bilingual subtitle syncing.

coastalcph/lex-glue

LexGLUE: A Benchmark Dataset for Legal Language Understanding in English

PhilipMay/stsb-multi-mt

Machine translated multilingual STS benchmark dataset.

Explore NLP Tools

All categories Trending NLP directory Insights