A practical reference for the most important concepts in modern large language models (LLMs). Each entry includes a concise definition and links to the best explanations and resources on this site.

Core Transformer Concepts

# Causal Attention
A self-attention setup where each token can attend only to itself and earlier tokens. This masking is what lets decoder-only LLMs generate text autoregressively.
# Multi-Head Attention (MHA)
The original transformer attention mechanism in which the model computes multiple attention 'heads' in parallel, each with its own learned projections for queries, keys, and values.
# QK-Norm
Normalization applied to the query and key vectors before the attention dot product. It improves training stability, especially in very large models.
# Root Mean Square Layer Normalization (RMSNorm)
A simplified and more computationally efficient variant of LayerNorm that only uses the root mean square of the inputs for normalization. Widely adopted in Llama, Mistral, and many modern models.
# SwiGLU
A gated feed-forward block built around a SiLU or Swish-style gate. It replaced older plain GELU MLP blocks in many modern LLM architectures.

Attention and Context Efficiency

# Compressed Sparse Attention / Heavily Compressed Attention
Long-context attention variants that compress older context along the sequence dimension so the cache or attention map has fewer effective entries.
# Context Length
The maximum number of tokens a model can condition on at once. Larger contexts can handle longer documents but increase memory and compute costs.
# Cross-Layer KV Sharing
An efficiency technique where some layers reuse key and value tensors from earlier layers, reducing how many separate KV caches grow with context length.
# DeepSeek Sparse Attention
A sparse attention mechanism that selects a subset of previous tokens to reduce long-context attention cost in DeepSeek V3.2- and GLM-5-style models.
# Grouped-Query Attention (GQA)
An attention mechanism where multiple query heads share a smaller set of key and value heads. It significantly reduces KV cache memory during inference while preserving most of the quality of standard multi-head attention.
# KV Cache
A memory buffer that stores previously computed key and value vectors during autoregressive text generation, avoiding redundant computation and dramatically speeding up inference.
# Multi-Head Latent Attention (MLA)
An efficient attention variant (used in DeepSeek-V2 and later models) that compresses the key-value cache using low-rank projections, achieving strong performance with much lower memory usage than standard attention.
# Multi-Query Attention (MQA)
An attention variant where all query heads share one set of key and value heads. It reduces KV cache memory more aggressively than GQA, often with a larger quality tradeoff.
# Sliding Window Attention (SWA)
An attention pattern where each token can only attend to a fixed-size local window of previous tokens (plus global tokens in some variants). Used in models like Mistral and Gemma to reduce compute.

Architecture Variants

# Gated Attention
An attention block that keeps content-based attention but adds gating and stabilizing tweaks so the model can modulate attention output more flexibly.
# Hybrid Attention
A broad architecture pattern that mixes full attention with cheaper sequence modules such as linear attention or state-space layers, aiming to reduce long-context cost while preserving retrieval ability.
# Latent MoE
A mixture-of-experts design that performs expert routing in a lower-dimensional latent space before projecting back to the model width.
# Layer-Wise Attention Budgeting
An architecture choice that varies attention capacity across layer types, such as assigning different query-head counts to global and sliding-window layers.
# Mixture of Experts (MoE)
An architecture that routes each token to a subset of specialized 'expert' feed-forward networks. This allows models to have a very large total parameter count while keeping the active parameter count (and compute) much smaller.
# No Positional Embeddings (NoPE)
A design choice where selected attention layers omit explicit positional embeddings. Recent models usually use it selectively rather than removing positional information everywhere.
# Per-Layer Embeddings (PLE)
Small layer-specific token vectors added inside the model so edge-scale architectures can gain embedding capacity without widening the full transformer path.

Tokenization and Embeddings

# Byte Pair Encoding (BPE)
A subword tokenization algorithm that starts with individual characters and iteratively merges the most frequent pairs. It is the foundation of most modern LLM tokenizers (GPT, Llama, etc.).
# Rotary Positional Embeddings (RoPE)
A relative positional encoding technique that rotates query and key vectors based on token position. It supports relative-position behavior and is commonly paired with long-context scaling methods in modern LLMs.

Training and Finetuning

# Base Model
A model after pretraining but before instruction tuning or alignment. It is good at next-token prediction but not necessarily good at following user instructions.
# Instruct Model
A model tuned on instruction-response examples so it follows prompts, formats answers more usefully, and behaves more like an assistant.
# Instruction Finetuning (SFT)
The supervised fine-tuning stage after pretraining where the model is trained on instruction-response pairs to follow user instructions and produce helpful outputs.
# LoRA (Low-Rank Adaptation)
A parameter-efficient fine-tuning method that freezes the original model weights and learns small low-rank update matrices instead. It greatly reduces memory and storage requirements for adaptation.
# Pretraining
The initial large-scale training phase where a model learns general language patterns by predicting the next token on massive unlabeled text corpora.
# Prompt Template
The fixed formatting pattern around user instructions, system messages, and responses. It matters because finetuning teaches the model to follow a specific token pattern.

Post-Training and Reasoning

# DPO (Direct Preference Optimization)
A simpler and more stable alternative to RLHF for aligning language models with human preferences. It directly optimizes the model using a classification-style loss on preference data.
# Distillation
A training approach where a smaller or cheaper model learns from outputs produced by a stronger teacher model, often to transfer reasoning behavior or style.
# GRPO (Group Relative Policy Optimization)
A reinforcement learning algorithm used in reasoning model training (notably in DeepSeek-R1) that estimates advantages by comparing groups of responses instead of using a separate value model.
# Inference-Time Scaling
A family of methods that spend more computation during generation, rather than only during training, to improve reasoning or answer quality.
# RLHF (Reinforcement Learning from Human Feedback)
A post-training technique that uses human preference data and reinforcement learning (usually with a reward model) to align model outputs with human values and preferences.
# RLVR (Reinforcement Learning with Verifiable Rewards)
A post-training approach where the reward signal comes from automatically checkable answers, such as math results, code tests, or symbolic verifiers, instead of a learned reward model.
# Reasoning Model
A large language model trained or prompted to spend more inference compute on multi-step problems, often using reinforcement learning and inference-time scaling techniques.

Inference and Decoding

# Temperature
A decoding control that rescales next-token probabilities. Lower values make outputs more deterministic, while higher values increase randomness.
# Tool Use
A model behavior where the LLM calls an external tool, uses the returned result, and incorporates that result into its response.
# Top-k Sampling
A decoding method that samples only from the k most likely next tokens, discarding the rest of the probability distribution.
# Top-p Sampling
A decoding method that samples from the smallest set of likely next tokens whose cumulative probability reaches a chosen threshold p.

Evaluation

# LLM-as-a-Judge
An evaluation setup where another language model scores, compares, or critiques generated answers, especially when exact string matching is too brittle.
# Perplexity
The exponential of average cross-entropy loss on a dataset. It measures next-token prediction quality, not general helpfulness or alignment.
# Verifier
A program, rule, test suite, or scoring function that checks whether an answer is correct. Verifiers are central to many RLVR and reasoning-evaluation workflows.

Systems and Hardware

# FlashAttention
A memory-efficient exact attention algorithm that uses tiling and kernel fusion to avoid materializing the full attention matrix in high-bandwidth memory, leading to major speed and memory improvements during training.
# Mixed Precision
A training or inference strategy that uses lower-precision number formats where possible to reduce memory use and improve hardware throughput.
# bfloat16
A 16-bit floating-point format with a wide exponent range. It is often easier to use for LLM training than float16 because it is less prone to overflow.
# torch.compile
A PyTorch feature that can speed up repeated workloads by reducing Python overhead and enabling graph-level kernel optimizations.