sangmichaelxie/doremi

Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets

/ 100

Emerging

This project helps large language model developers efficiently train their models. It takes your extensive text dataset, separated into different content domains (like web pages, books, or scientific articles), and outputs an optimized recipe for how much of each domain to use during training. This ensures your final language model performs well across many different tasks, without needing to manually guess the right balance of data.

352 stars. No commits in the last 6 months.

Use this if you are a large language model developer struggling to determine the ideal mix of data from various sources to train your models efficiently and achieve broad task performance.

Not ideal if you are looking to train a model on a single, uniform dataset or if you require multi-node training support, as this implementation focuses on single-node, multi-GPU setups.

large-language-models LLM-training dataset-curation natural-language-processing data-mixture-optimization

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 16 / 25

How are scores calculated?

Stars

352

Forks

Language

HTML

License

MIT

Higher-rated alternatives

scaleapi/llm-engine

Scale LLM Engine public repository

AGI-Arena/MARS

The official implementation of MARS: Unleashing the Power of Variance Reduction for Training Large Models

modelscope/easydistill

a toolkit on knowledge distillation for large language models

AGI-Edgerunners/LLM-Adapters

Code for our EMNLP 2023 Paper: "LLM-Adapters: An Adapter Family for Parameter-Efficient...

Wang-ML-Lab/bayesian-peft

Bayesian Low-Rank Adaptation of LLMs: BLoB [NeurIPS 2024] and TFB [NeurIPS 2025]

Explore Transformer Models

All categories Trending Transformer directory Insights