alon-albalak/data-selection-survey
A Survey on Data Selection for Language Models
This is a curated collection of research papers focused on how to best select data for training language models. It provides an organized list of academic work, helping researchers and practitioners understand different strategies for preparing data, from initial filtering to fine-tuning. The resource is designed for AI researchers and machine learning engineers working on developing or improving large language models.
255 stars. No commits in the last 6 months.
Use this if you are developing large language models and need to understand current best practices and research in data selection for optimal model performance.
Not ideal if you are looking for an off-the-shelf software tool or a step-by-step guide to implement data selection for a non-language model machine learning task.
Stars
255
Forks
15
Language
—
License
CC0-1.0
Category
Last pushed
Apr 29, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/alon-albalak/data-selection-survey"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.