srush/LLM-Training-Puzzles

What would you do with 1000 H100s...

42
/ 100
Emerging

This is a collection of 8 challenging puzzles about training large language models (or really any NN) on many, many GPUs. Very few people actually get a chance to train on thousands of computers, but it is an interesting challenge and one that is critically important for modern AI. The goal of these puzzles is to get hands-on experience with the key primitives and to understand the goals of memory efficiency and compute pipelining.

1,157 stars. No commits in the last 6 months.

Use this if you are a machine learning engineer or researcher looking to deepen your practical understanding of large-scale distributed training for deep neural networks.

Not ideal if you are looking for a tool to train models on a single GPU or a small cluster without focusing on extreme memory and compute optimization challenges.

distributed-training large-language-models deep-learning-optimization GPU-programming AI-infrastructure
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 16 / 25

How are scores calculated?

Stars

1,157

Forks

72

Language

Jupyter Notebook

License

MIT

Last pushed

Jan 10, 2024

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/srush/LLM-Training-Puzzles"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.