What are good first optimizations before moving from one GPU to multi-GPU training?

Question

Accepted Answer

Before moving from one GPU to multi-GPU training, it usually makes sense to exhaust the simpler single-GPU optimizations first. They are easier to debug, cheaper to run, and often deliver surprisingly large gains.

The repo’s training-speed material is a good example: a long sequence of single-GPU improvements dramatically increased throughput before distributed training entered the picture.

The repo includes a direct comparison workflow for inspecting how the optimized single-GPU code differs from the baseline implementation

Good first optimizations include:

using bfloat16
enabling tensor-core-friendly settings
using fused optimizers
enabling pinned memory in the data loader
replacing from-scratch kernels with optimized PyTorch implementations where appropriate
using FlashAttention
trying torch.compile
tuning batch size after the other optimizations are in place

These changes matter because distributed training adds a lot of complexity:

process-group setup
synchronization overhead
harder debugging
more failure modes

So if a single GPU is still leaving obvious speed or memory gains unused, multi-GPU is usually not the first thing to fix.

The repo’s optimization summary is effectively a single-GPU checklist: several large improvements come before distributed data parallelism is even necessary

The practical rule is to move to multi-GPU when:

the optimized single-GPU path is already reasonably efficient
the model or data scale truly exceeds one GPU
the added complexity is justified

In short, good first optimizations before multi-GPU training are mixed precision, efficient kernels, fused optimizers, better data loading, and compilation, because those changes often unlock large gains without distributed-training complexity.