Chapter 4 is where the main modern decoder variations start to matter: KV-cache compression, local attention, sparse feed-forward layers, and linear-attention hybrids.

These pages turn the local temp/LLMs-from-scratch/ch04 bonus chapters into quick website guides, with figures and examples pulled from the corresponding chapter scripts, later LLMs-from-scratch notebooks, and linked architecture articles.

Hybrid Attention figure

Hybrid Attention

A broader family of architectures that replaces most full-attention layers with cheaper linear or state-space sequence modules while keeping a smaller number of heavier retrieval layers.