Why can model loading require much more memory than expected?

Question

Accepted Answer

Model loading can require much more memory than expected because the obvious loading path often creates temporary duplication.

A common pattern is:

instantiate the model
load the checkpoint into a state dictionary
copy those weights into the model

During that process, both the model parameters and the loaded checkpoint data may exist in memory at the same time. For large LLMs, that temporary overlap can be enormous.

This is why loading can feel surprisingly expensive even before training or inference starts.

The main reasons are:

the checkpoint file is read into memory
the model already exists separately
weight copies may happen before old buffers are released
moving weights between CPU and GPU can add more temporary pressure

So “the model is 7 GB” does not mean “loading it only needs 7 GB.” The loading path can require much more than the final steady-state footprint.

That is why the repo’s memory-efficient loading materials recommend more careful approaches such as sequential loading, meta-device initialization, or memory mapping. These techniques are designed to lower the peak memory cost, not just the final memory cost.

In short, model loading can use much more memory than expected because naive loading often materializes both the checkpoint and the target model at once, creating large temporary memory spikes before the final weights settle into place.