Model loading can require much more memory than expected because the obvious loading path often creates temporary duplication.

A common pattern is:

  1. instantiate the model
  2. load the checkpoint into a state dictionary
  3. copy those weights into the model

During that process, both the model parameters and the loaded checkpoint data may exist in memory at the same time. For large LLMs, that temporary overlap can be enormous.

The repo’s memory-efficient loading material focuses on exactly this problem: the naive loading path can create large avoidable peak-memory spikes

This is why loading can feel surprisingly expensive even before training or inference starts.

The main reasons are:

  • the checkpoint file is read into memory
  • the model already exists separately
  • weight copies may happen before old buffers are released
  • moving weights between CPU and GPU can add more temporary pressure

So “the model is 7 GB” does not mean “loading it only needs 7 GB.” The loading path can require much more than the final steady-state footprint.

That is why the repo’s memory-efficient loading materials recommend more careful approaches such as sequential loading, meta-device initialization, or memory mapping. These techniques are designed to lower the peak memory cost, not just the final memory cost.

In short, model loading can use much more memory than expected because naive loading often materializes both the checkpoint and the target model at once, creating large temporary memory spikes before the final weights settle into place.