Chapter 4 Concept Guides

Chapter 4 is where the main modern decoder variations start to matter: KV-cache compression, local attention, sparse feed-forward layers, and linear-attention hybrids.

These pages turn the local temp/LLMs-from-scratch/ch04 bonus chapters into quick website guides, with figures and examples pulled from the corresponding chapter scripts, later LLMs-from-scratch notebooks, and linked architecture articles.

RMSNorm

A cheaper alternative to LayerNorm that keeps activation magnitudes stable without explicitly subtracting the mean.

Open guide Back to gallery

Rotary Positional Embeddings (RoPE)

Encodes token position by rotating query and key vectors, which is why it replaced learned absolute position embeddings in most LLMs.

Open guide Back to gallery

SiLU / Swish

The smooth activation behind modern SwiGLU feed-forward blocks, favored largely for efficiency and optimization stability.

Open guide Back to gallery

QK-Norm

Applies RMSNorm to queries and keys inside attention before RoPE, helping stabilize large-model training.

Open guide Back to gallery

No Positional Embeddings (NoPE)

Drops explicit positional encodings in selected layers and relies on causal order plus learned behavior instead.

Open guide Back to gallery

Grouped-Query Attention (GQA)

Multiple query heads share the same key-value projections, reducing KV-cache size without changing the overall decoder pattern.

Open guide Back to gallery

Multi-Head Latent Attention (MLA)

Compresses key-value state before it goes into the KV cache, which is why DeepSeek-style models use it.

Open guide Back to gallery

Sliding Window Attention (SWA)

Makes selected attention layers local by letting tokens see only a recent window instead of the full prefix.

Open guide Back to gallery

Mixture of Experts (MoE)

A sparse feed-forward setup where only a few experts are active per token, even though total model capacity is much larger.

Open guide Back to gallery

Gated Attention

A tuned attention block that keeps content-based attention but adds an output gate and a few stabilizing tweaks.

Open guide Back to gallery

Hybrid Attention

A broader family of architectures that replaces most full-attention layers with cheaper linear or state-space sequence modules while keeping a smaller number of heavier retrieval layers.

Open guide Back to gallery

DeepSeek Sparse Attention

A learned sparse attention mask used to reduce long-context costs in DeepSeek V3.2- and GLM-5-style models.

Open guide Back to gallery