huggingface/datablations
Scaling Data-Constrained Language Models
This project helps machine learning researchers and practitioners efficiently train large language models, especially when data is limited. It provides preprocessed datasets, trained models, and experimental results, allowing users to understand the impact of data repetition, quality filtering, and code augmentation on model performance and resource usage. Researchers can use this to optimize their model training strategies for different data constraints.
342 stars. No commits in the last 6 months.
Use this if you are a researcher or practitioner focused on training large language models and need to understand how to maximize performance with limited or imperfect data resources.
Not ideal if you are looking for a plug-and-play solution for general language model fine-tuning without deep involvement in data preprocessing or model scaling research.
Stars
342
Forks
18
Language
Jupyter Notebook
License
Apache-2.0
Category
Last pushed
Jun 28, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/huggingface/datablations"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
jncraton/languagemodels
Explore large language models in 512MB of RAM
microsoft/unilm
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
haizelabs/verdict
Inference-time scaling for LLMs-as-a-judge.
albertan017/LLM4Decompile
Reverse Engineering: Decompiling Binary Code with Large Language Models
bytedance/Sa2VA
Official Repo For Pixel-LLM Codebase