huggingface/datablations

Scaling Data-Constrained Language Models

/ 100

Emerging

This project helps machine learning researchers and practitioners efficiently train large language models, especially when data is limited. It provides preprocessed datasets, trained models, and experimental results, allowing users to understand the impact of data repetition, quality filtering, and code augmentation on model performance and resource usage. Researchers can use this to optimize their model training strategies for different data constraints.

342 stars. No commits in the last 6 months.

Use this if you are a researcher or practitioner focused on training large language models and need to understand how to maximize performance with limited or imperfect data resources.

Not ideal if you are looking for a plug-and-play solution for general language model fine-tuning without deep involvement in data preprocessing or model scaling research.

Language Model Training Data Efficiency Deep Learning Research Model Scaling NLP Resource Optimization

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 12 / 25

How are scores calculated?

Stars

342

Forks

Language

Jupyter Notebook

License

Apache-2.0

Higher-rated alternatives

jncraton/languagemodels

Explore large language models in 512MB of RAM

microsoft/unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

haizelabs/verdict

Inference-time scaling for LLMs-as-a-judge.

albertan017/LLM4Decompile

Reverse Engineering: Decompiling Binary Code with Large Language Models

bytedance/Sa2VA

Official Repo For Pixel-LLM Codebase

Explore Transformer Models

All categories Trending Transformer directory Insights