Machine Learning FAQ
What are the most common practical bottlenecks when training or running an LLM on limited hardware?
The most common practical bottlenecks on limited hardware are almost always some combination of memory and throughput.
For training, the biggest memory consumers are usually:
- model weights
- optimizer states
- activations needed for backpropagation
- batch size
- sequence or context length
For inference, the main memory consumers are usually:
- model weights
- KV cache for long prompts and long outputs
- precision choice such as fp32 versus bf16
One useful way to separate the problem is:
Training bottlenecks
- activation memory
- optimizer-state memory
- slow attention and matrix kernels
- small or poorly utilized batch sizes
Inference bottlenecks
- loading the model into RAM or VRAM
- KV-cache growth with context length
- sequential token generation latency
The repo includes dedicated material on both sides of this.
For model loading, one practical issue is peak memory during checkpoint loading. A naive weight-loading approach can briefly hold multiple copies of the model or state dict in memory at once, which can make loading fail even when the final model would fit.

For runtime and training throughput, the repo’s PyTorch performance notes highlight several common bottlenecks and fixes:
- high precision instead of bf16
- non-fused optimizers
- inefficient attention implementations
- missing FlashAttention
- missing
torch.compile - poor tensor shapes or batch sizes

In practice, the most common hardware constraints are:
1. Not enough RAM or VRAM for the model itself
This shows up during model loading or immediately when moving the model to device.
2. Not enough memory for training states
Even if the weights fit, full finetuning may fail because optimizer states and activations push memory usage much higher than inference.
3. Context length becomes too expensive
Longer context increases attention cost and KV-cache size, so a model that feels fine at 2k tokens may become impractical at 32k or 128k.
4. Batch size is too small for good throughput
Limited memory often forces tiny batches, which can hurt hardware utilization and make training very slow.
5. Weight loading causes peak-memory spikes
This is especially frustrating on constrained machines because loading may fail before the actual work even begins.
6. The kernels are not optimized
Without lower precision, fused kernels, or efficient attention implementations, training and inference can be much slower than necessary.
This is why so many modern efficiency techniques target exactly these bottlenecks:
- GQA and SWA reduce KV-cache cost
- KV caching reduces repeated inference compute
- LoRA reduces finetuning memory
- bf16 lowers memory and often improves speed
- FlashAttention improves attention efficiency
- memory-mapped or meta-device loading reduces peak memory pressure
So the practical bottleneck is usually not one abstract idea like “the model is too big.” It is more often a specific resource mismatch: too much activation memory, too much KV cache, too much optimizer state, too much peak load memory, or too little throughput for the available hardware.
In short, the main LLM bottlenecks on limited hardware are weight memory, activation memory, optimizer-state memory, KV-cache growth, checkpoint-loading peaks, and slow kernels, with context length, batch size, and precision choices often determining which of those becomes the limiting factor.