davidschulte/hf-dataset-selector
Find the best datasets for intermediate fine-tuning
When you're building a language model for a specific text task but lack enough training data, this tool helps you find additional, relevant datasets. You provide your target dataset and a base language model, and it outputs a ranked list of publicly available datasets from Hugging Face that are most likely to improve your model's performance through an intermediate fine-tuning step. This is for machine learning engineers or researchers working on natural language processing.
No commits in the last 6 months.
Use this if you have a specific text classification or generation task and need to find additional, related datasets to boost your language model's performance due to limited proprietary training data.
Not ideal if you already have ample training data for your specific task, or if your primary goal is to train a language model from scratch without leveraging existing pre-trained models or external datasets.
Stars
9
Forks
2
Language
Jupyter Notebook
License
Apache-2.0
Category
Last pushed
May 04, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/davidschulte/hf-dataset-selector"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
coetaur0/ESIM
Implementation of the ESIM model for natural language inference with PyTorch
erickrf/multiffn-nli
Implementation of the multi feed-forward network architecture by Parikh et al. (2016) for...
vanzytay/EMNLP2018_NLI
Repository for NLI models (EMNLP 2018)
hsinyuan-huang/FusionNet-NLI
An example for applying FusionNet to Natural Language Inference
sdnr1/EBIM-NLI
Enhanced BiLSTM Inference Model for Natural Language Inference