LLM Glossary
A practical reference for the most important concepts in modern large language models (LLMs). Each entry includes a concise definition and links to the best explanations and resources on this site.
Core Transformer Concepts
#
Causal Attention
A self-attention setup where each token can attend only to itself and earlier tokens. This masking is what lets decoder-only LLMs generate text autoregressively.
Additional resources:
#
Multi-Head Attention (MHA)
The original transformer attention mechanism in which the model computes multiple attention 'heads' in parallel, each with its own learned projections for queries, keys, and values.
#
QK-Norm
Normalization applied to the query and key vectors before the attention dot product. It improves training stability, especially in very large models.
Additional resources:
#
Root Mean Square Layer Normalization (RMSNorm)
A simplified and more computationally efficient variant of LayerNorm that only uses the root mean square of the inputs for normalization. Widely adopted in Llama, Mistral, and many modern models.
Additional resources:
#
SwiGLU
A gated feed-forward block built around a SiLU or Swish-style gate. It replaced older plain GELU MLP blocks in many modern LLM architectures.
Additional resources:
Attention and Context Efficiency
#
Compressed Sparse Attention / Heavily Compressed Attention
Long-context attention variants that compress older context along the sequence dimension so the cache or attention map has fewer effective entries.
#
Context Length
The maximum number of tokens a model can condition on at once. Larger contexts can handle longer documents but increase memory and compute costs.
#
Cross-Layer KV Sharing
An efficiency technique where some layers reuse key and value tensors from earlier layers, reducing how many separate KV caches grow with context length.
#
DeepSeek Sparse Attention
A sparse attention mechanism that selects a subset of previous tokens to reduce long-context attention cost in DeepSeek V3.2- and GLM-5-style models.
#
Grouped-Query Attention (GQA)
An attention mechanism where multiple query heads share a smaller set of key and value heads. It significantly reduces KV cache memory during inference while preserving most of the quality of standard multi-head attention.
#
KV Cache
A memory buffer that stores previously computed key and value vectors during autoregressive text generation, avoiding redundant computation and dramatically speeding up inference.
Additional resources:
#
Multi-Head Latent Attention (MLA)
An efficient attention variant (used in DeepSeek-V2 and later models) that compresses the key-value cache using low-rank projections, achieving strong performance with much lower memory usage than standard attention.
Additional resources:
#
Multi-Query Attention (MQA)
An attention variant where all query heads share one set of key and value heads. It reduces KV cache memory more aggressively than GQA, often with a larger quality tradeoff.
Additional resources:
#
Sliding Window Attention (SWA)
An attention pattern where each token can only attend to a fixed-size local window of previous tokens (plus global tokens in some variants). Used in models like Mistral and Gemma to reduce compute.
Additional resources:
Architecture Variants
#
Gated Attention
An attention block that keeps content-based attention but adds gating and stabilizing tweaks so the model can modulate attention output more flexibly.
Additional resources:
#
Hybrid Attention
A broad architecture pattern that mixes full attention with cheaper sequence modules such as linear attention or state-space layers, aiming to reduce long-context cost while preserving retrieval ability.
Additional resources:
#
Latent MoE
A mixture-of-experts design that performs expert routing in a lower-dimensional latent space before projecting back to the model width.
Additional resources:
#
Layer-Wise Attention Budgeting
An architecture choice that varies attention capacity across layer types, such as assigning different query-head counts to global and sliding-window layers.
#
Mixture of Experts (MoE)
An architecture that routes each token to a subset of specialized 'expert' feed-forward networks. This allows models to have a very large total parameter count while keeping the active parameter count (and compute) much smaller.
#
No Positional Embeddings (NoPE)
A design choice where selected attention layers omit explicit positional embeddings. Recent models usually use it selectively rather than removing positional information everywhere.
#
Per-Layer Embeddings (PLE)
Small layer-specific token vectors added inside the model so edge-scale architectures can gain embedding capacity without widening the full transformer path.
Tokenization and Embeddings
#
Byte Pair Encoding (BPE)
A subword tokenization algorithm that starts with individual characters and iteratively merges the most frequent pairs. It is the foundation of most modern LLM tokenizers (GPT, Llama, etc.).
Additional resources:
#
Rotary Positional Embeddings (RoPE)
A relative positional encoding technique that rotates query and key vectors based on token position. It supports relative-position behavior and is commonly paired with long-context scaling methods in modern LLMs.
Training and Finetuning
#
Base Model
A model after pretraining but before instruction tuning or alignment. It is good at next-token prediction but not necessarily good at following user instructions.
#
Instruct Model
A model tuned on instruction-response examples so it follows prompts, formats answers more usefully, and behaves more like an assistant.
#
Instruction Finetuning (SFT)
The supervised fine-tuning stage after pretraining where the model is trained on instruction-response pairs to follow user instructions and produce helpful outputs.
#
LoRA (Low-Rank Adaptation)
A parameter-efficient fine-tuning method that freezes the original model weights and learns small low-rank update matrices instead. It greatly reduces memory and storage requirements for adaptation.
Additional resources:
#
Pretraining
The initial large-scale training phase where a model learns general language patterns by predicting the next token on massive unlabeled text corpora.
#
Prompt Template
The fixed formatting pattern around user instructions, system messages, and responses. It matters because finetuning teaches the model to follow a specific token pattern.
Post-Training and Reasoning
#
DPO (Direct Preference Optimization)
A simpler and more stable alternative to RLHF for aligning language models with human preferences. It directly optimizes the model using a classification-style loss on preference data.
#
Distillation
A training approach where a smaller or cheaper model learns from outputs produced by a stronger teacher model, often to transfer reasoning behavior or style.
#
GRPO (Group Relative Policy Optimization)
A reinforcement learning algorithm used in reasoning model training (notably in DeepSeek-R1) that estimates advantages by comparing groups of responses instead of using a separate value model.
Additional resources:
#
Inference-Time Scaling
A family of methods that spend more computation during generation, rather than only during training, to improve reasoning or answer quality.
#
RLHF (Reinforcement Learning from Human Feedback)
A post-training technique that uses human preference data and reinforcement learning (usually with a reward model) to align model outputs with human values and preferences.
#
RLVR (Reinforcement Learning with Verifiable Rewards)
A post-training approach where the reward signal comes from automatically checkable answers, such as math results, code tests, or symbolic verifiers, instead of a learned reward model.
#
Reasoning Model
A large language model trained or prompted to spend more inference compute on multi-step problems, often using reinforcement learning and inference-time scaling techniques.
Inference and Decoding
#
Temperature
A decoding control that rescales next-token probabilities. Lower values make outputs more deterministic, while higher values increase randomness.
Additional resources:
#
Tool Use
A model behavior where the LLM calls an external tool, uses the returned result, and incorporates that result into its response.
#
Top-k Sampling
A decoding method that samples only from the k most likely next tokens, discarding the rest of the probability distribution.
Additional resources:
#
Top-p Sampling
A decoding method that samples from the smallest set of likely next tokens whose cumulative probability reaches a chosen threshold p.
Additional resources:
Evaluation
#
LLM-as-a-Judge
An evaluation setup where another language model scores, compares, or critiques generated answers, especially when exact string matching is too brittle.
#
Perplexity
The exponential of average cross-entropy loss on a dataset. It measures next-token prediction quality, not general helpfulness or alignment.
#
Verifier
A program, rule, test suite, or scoring function that checks whether an answer is correct. Verifiers are central to many RLVR and reasoning-evaluation workflows.
Systems and Hardware
#
FlashAttention
A memory-efficient exact attention algorithm that uses tiling and kernel fusion to avoid materializing the full attention matrix in high-bandwidth memory, leading to major speed and memory improvements during training.
Additional resources:
#
Mixed Precision
A training or inference strategy that uses lower-precision number formats where possible to reduce memory use and improve hardware throughput.
Additional resources:
#
bfloat16
A 16-bit floating-point format with a wide exponent range. It is often easier to use for LLM training than float16 because it is less prone to overflow.
Additional resources:
#
torch.compile
A PyTorch feature that can speed up repeated workloads by reducing Python overhead and enabling graph-level kernel optimizations.
Additional resources: