Why does mixed precision such as `bfloat16` help so much in practice?

Question

Accepted Answer

Mixed precision helps so much in practice because it attacks two of the biggest LLM bottlenecks at once:

memory footprint
compute throughput

When you move much of training from fp32 to a lower-precision format such as bfloat16, tensors take less space and can often use faster hardware execution paths such as tensor cores.

The reason bfloat16 is especially attractive is that it keeps a large exponent range similar to fp32. That makes it more forgiving numerically than plain float16, which is why it became a common practical default for modern training workloads.

The repo’s PyTorch performance summary includes mixed precision as one of the biggest practical speed and memory improvements

The repo’s training-speed measurements show why this matters: switching to bfloat16 roughly halved reserved memory in the example and also gave a large throughput jump.

That is the main reason mixed precision is such a big deal in real systems:

smaller activations
smaller intermediate tensors
faster matrix operations on supported hardware

This does not mean everything always runs in low precision with no safeguards. Practical mixed-precision training often keeps some operations, reductions, or master states in more stable forms where needed. But the overall effect is still dramatic.

Mixed precision is also one of the first optimizations people reach for because it usually gives a large payoff without requiring a redesign of the model itself.

In short, mixed precision such as bfloat16 helps so much because it reduces memory use and enables faster hardware execution, while bfloat16 remains stable enough to be practical for large-scale LLM training.