Machine Learning FAQ
Why does mixed precision such as `bfloat16` help so much in practice?
Mixed precision helps so much in practice because it attacks two of the biggest LLM bottlenecks at once:
- memory footprint
- compute throughput
When you move much of training from fp32 to a lower-precision format such as bfloat16, tensors take less space and can often use faster hardware execution paths such as tensor cores.
The reason bfloat16 is especially attractive is that it keeps a large exponent range similar to fp32. That makes it more forgiving numerically than plain float16, which is why it became a common practical default for modern training workloads.

The repo’s training-speed measurements show why this matters: switching to bfloat16 roughly halved reserved memory in the example and also gave a large throughput jump.
That is the main reason mixed precision is such a big deal in real systems:
- smaller activations
- smaller intermediate tensors
- faster matrix operations on supported hardware
This does not mean everything always runs in low precision with no safeguards. Practical mixed-precision training often keeps some operations, reductions, or master states in more stable forms where needed. But the overall effect is still dramatic.
Mixed precision is also one of the first optimizations people reach for because it usually gives a large payoff without requiring a redesign of the model itself.
In short, mixed precision such as bfloat16 helps so much because it reduces memory use and enables faster hardware execution, while bfloat16 remains stable enough to be practical for large-scale LLM training.