sangmichaelxie/doremi
Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets
This project helps large language model developers efficiently train their models. It takes your extensive text dataset, separated into different content domains (like web pages, books, or scientific articles), and outputs an optimized recipe for how much of each domain to use during training. This ensures your final language model performs well across many different tasks, without needing to manually guess the right balance of data.
352 stars. No commits in the last 6 months.
Use this if you are a large language model developer struggling to determine the ideal mix of data from various sources to train your models efficiently and achieve broad task performance.
Not ideal if you are looking to train a model on a single, uniform dataset or if you require multi-node training support, as this implementation focuses on single-node, multi-GPU setups.
Stars
352
Forks
36
Language
HTML
License
MIT
Category
Last pushed
Dec 26, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/sangmichaelxie/doremi"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
scaleapi/llm-engine
Scale LLM Engine public repository
AGI-Arena/MARS
The official implementation of MARS: Unleashing the Power of Variance Reduction for Training Large Models
modelscope/easydistill
a toolkit on knowledge distillation for large language models
AGI-Edgerunners/LLM-Adapters
Code for our EMNLP 2023 Paper: "LLM-Adapters: An Adapter Family for Parameter-Efficient...
Wang-ML-Lab/bayesian-peft
Bayesian Low-Rank Adaptation of LLMs: BLoB [NeurIPS 2024] and TFB [NeurIPS 2025]