davidschulte/hf-dataset-selector

Find the best datasets for intermediate fine-tuning

36
/ 100
Emerging

When you're building a language model for a specific text task but lack enough training data, this tool helps you find additional, relevant datasets. You provide your target dataset and a base language model, and it outputs a ranked list of publicly available datasets from Hugging Face that are most likely to improve your model's performance through an intermediate fine-tuning step. This is for machine learning engineers or researchers working on natural language processing.

No commits in the last 6 months.

Use this if you have a specific text classification or generation task and need to find additional, related datasets to boost your language model's performance due to limited proprietary training data.

Not ideal if you already have ample training data for your specific task, or if your primary goal is to train a language model from scratch without leveraging existing pre-trained models or external datasets.

Natural Language Processing Machine Learning Model Training Text Classification Dataset Curation
Stale 6m No Package No Dependents
Maintenance 2 / 25
Adoption 5 / 25
Maturity 16 / 25
Community 13 / 25

How are scores calculated?

Stars

9

Forks

2

Language

Jupyter Notebook

License

Apache-2.0

Last pushed

May 04, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/davidschulte/hf-dataset-selector"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.