The KV cache becomes a big memory bottleneck at long context lengths because it stores past keys and values for every token the model has already processed, and it has to do this across many layers.

That means the cache grows with:

  • sequence length
  • number of layers
  • number of key-value heads
  • batch size
  • element precision such as fp32 or bf16

So even if the model weights stay fixed, the KV cache keeps growing as the conversation or document gets longer.

The KV-cache diagrams in the repo show how previously computed keys and values are retained so the model can avoid recomputing them at every generation step

This matters especially at long context because the cache is not a tiny side structure. In large decoder-only LLMs, it can become one of the main consumers of inference memory.

That is exactly why so many modern architecture changes focus on reducing KV-cache cost rather than only reducing weight size. For example:

  • GQA reduces the number of distinct key-value heads
  • sliding-window attention limits how much context must remain active
  • lower precision reduces bytes per stored element

The repo’s GQA memory estimates make the scaling pressure visible: as context grows, KV-cache costs quickly dominate and architectural savings become significant

So the KV cache is not just an optimization. It is both:

  • a speedup mechanism, because it avoids recomputing old keys and values
  • a memory burden, because all of those stored tensors have to live somewhere

At short contexts, that burden may be manageable. At long contexts, it becomes one of the main constraints on deployment.

In short, the KV cache is a major memory bottleneck at long context lengths because it must store keys and values for every processed token across the whole transformer stack, and that storage grows directly with context length.