What are the most common practical bottlenecks when training or running an LLM on limited hardware?

Question

Accepted Answer

The most common practical bottlenecks on limited hardware are almost always some combination of memory and throughput.

For training, the biggest memory consumers are usually:

model weights
optimizer states
activations needed for backpropagation
batch size
sequence or context length

For inference, the main memory consumers are usually:

model weights
KV cache for long prompts and long outputs
precision choice such as fp32 versus bf16

One useful way to separate the problem is:

Training bottlenecks

activation memory
optimizer-state memory
slow attention and matrix kernels
small or poorly utilized batch sizes

Inference bottlenecks

loading the model into RAM or VRAM
KV-cache growth with context length
sequential token generation latency

The repo includes dedicated material on both sides of this.

For model loading, one practical issue is peak memory during checkpoint loading. A naive weight-loading approach can briefly hold multiple copies of the model or state dict in memory at once, which can make loading fail even when the final model would fit.

For runtime and training throughput, the repo’s PyTorch performance notes highlight several common bottlenecks and fixes:

high precision instead of bf16
non-fused optimizers
inefficient attention implementations
missing FlashAttention
missing torch.compile
poor tensor shapes or batch sizes

The training-speed notes summarize practical optimization levers such as bf16, FlashAttention, and compilation

In practice, the most common hardware constraints are:

1. Not enough RAM or VRAM for the model itself

This shows up during model loading or immediately when moving the model to device.

2. Not enough memory for training states

Even if the weights fit, full finetuning may fail because optimizer states and activations push memory usage much higher than inference.

3. Context length becomes too expensive

Longer context increases attention cost and KV-cache size, so a model that feels fine at 2k tokens may become impractical at 32k or 128k.

4. Batch size is too small for good throughput

Limited memory often forces tiny batches, which can hurt hardware utilization and make training very slow.

5. Weight loading causes peak-memory spikes

This is especially frustrating on constrained machines because loading may fail before the actual work even begins.

6. The kernels are not optimized

Without lower precision, fused kernels, or efficient attention implementations, training and inference can be much slower than necessary.

This is why so many modern efficiency techniques target exactly these bottlenecks:

GQA and SWA reduce KV-cache cost
KV caching reduces repeated inference compute
LoRA reduces finetuning memory
bf16 lowers memory and often improves speed
FlashAttention improves attention efficiency
memory-mapped or meta-device loading reduces peak memory pressure

So the practical bottleneck is usually not one abstract idea like “the model is too big.” It is more often a specific resource mismatch: too much activation memory, too much KV cache, too much optimizer state, too much peak load memory, or too little throughput for the available hardware.

In short, the main LLM bottlenecks on limited hardware are weight memory, activation memory, optimizer-state memory, KV-cache growth, checkpoint-loading peaks, and slow kernels, with context length, batch size, and precision choices often determining which of those becomes the limiting factor.