Machine Learning FAQ
Why is the KV cache such a big memory bottleneck at long context lengths?
The KV cache becomes a big memory bottleneck at long context lengths because it stores past keys and values for every token the model has already processed, and it has to do this across many layers.
That means the cache grows with:
- sequence length
- number of layers
- number of key-value heads
- batch size
- element precision such as
fp32orbf16
So even if the model weights stay fixed, the KV cache keeps growing as the conversation or document gets longer.

This matters especially at long context because the cache is not a tiny side structure. In large decoder-only LLMs, it can become one of the main consumers of inference memory.
That is exactly why so many modern architecture changes focus on reducing KV-cache cost rather than only reducing weight size. For example:
- GQA reduces the number of distinct key-value heads
- sliding-window attention limits how much context must remain active
- lower precision reduces bytes per stored element

So the KV cache is not just an optimization. It is both:
- a speedup mechanism, because it avoids recomputing old keys and values
- a memory burden, because all of those stored tensors have to live somewhere
At short contexts, that burden may be manageable. At long contexts, it becomes one of the main constraints on deployment.
In short, the KV cache is a major memory bottleneck at long context lengths because it must store keys and values for every processed token across the whole transformer stack, and that storage grows directly with context length.