Machine Learning FAQ
Why can model loading require much more memory than expected?
Model loading can require much more memory than expected because the obvious loading path often creates temporary duplication.
A common pattern is:
- instantiate the model
- load the checkpoint into a state dictionary
- copy those weights into the model
During that process, both the model parameters and the loaded checkpoint data may exist in memory at the same time. For large LLMs, that temporary overlap can be enormous.

This is why loading can feel surprisingly expensive even before training or inference starts.
The main reasons are:
- the checkpoint file is read into memory
- the model already exists separately
- weight copies may happen before old buffers are released
- moving weights between CPU and GPU can add more temporary pressure
So “the model is 7 GB” does not mean “loading it only needs 7 GB.” The loading path can require much more than the final steady-state footprint.
That is why the repo’s memory-efficient loading materials recommend more careful approaches such as sequential loading, meta-device initialization, or memory mapping. These techniques are designed to lower the peak memory cost, not just the final memory cost.
In short, model loading can use much more memory than expected because naive loading often materializes both the checkpoint and the target model at once, creating large temporary memory spikes before the final weights settle into place.