sangmichaelxie/doremi

Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets

42
/ 100
Emerging

This project helps large language model developers efficiently train their models. It takes your extensive text dataset, separated into different content domains (like web pages, books, or scientific articles), and outputs an optimized recipe for how much of each domain to use during training. This ensures your final language model performs well across many different tasks, without needing to manually guess the right balance of data.

352 stars. No commits in the last 6 months.

Use this if you are a large language model developer struggling to determine the ideal mix of data from various sources to train your models efficiently and achieve broad task performance.

Not ideal if you are looking to train a model on a single, uniform dataset or if you require multi-node training support, as this implementation focuses on single-node, multi-GPU setups.

large-language-models LLM-training dataset-curation natural-language-processing data-mixture-optimization
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 16 / 25

How are scores calculated?

Stars

352

Forks

36

Language

HTML

License

MIT

Last pushed

Dec 26, 2023

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/sangmichaelxie/doremi"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.