LLM Glossary

Question 1

Grouped-Query Attention (GQA)

Accepted Answer

An attention mechanism where multiple query heads share a smaller set of key and value heads. It significantly reduces KV cache memory during inference while preserving most of the quality of standard multi-head attention.

Question 2

Multi-Head Latent Attention (MLA)

Accepted Answer

An efficient attention variant (used in DeepSeek-V2 and later models) that compresses the key-value cache using low-rank projections, achieving strong performance with much lower memory usage than standard attention.

Question 3

KV Cache

Accepted Answer

A memory buffer that stores previously computed key and value vectors during autoregressive text generation, avoiding redundant computation and dramatically speeding up inference.

Question 4

Mixture of Experts (MoE)

Accepted Answer

An architecture that routes each token to a subset of specialized 'expert' feed-forward networks. This allows models to have a very large total parameter count while keeping the active parameter count (and compute) much smaller.

Question 5

Looped Depth Sharing

Accepted Answer

A parameter-sharing architecture that applies the same transformer layer stack more than once. It increases effective computational depth without adding another set of layer weights, although each additional pass still requires computation.

Question 6

Short Convolution (ShortConv)

Accepted Answer

A small causal depthwise 1D convolution that mixes each hidden channel across a fixed number of nearby token positions. With a kernel size of 4, each output can combine the current token with the three preceding positions while keeping a fixed-size state during decoding.

Question 7

Rotary Positional Embeddings (RoPE)

Accepted Answer

A relative positional encoding technique that rotates query and key vectors based on token position. It supports relative-position behavior and is commonly paired with long-context scaling methods in modern LLMs.

Question 8

LoRA (Low-Rank Adaptation)

Accepted Answer

A parameter-efficient fine-tuning method that freezes the original model weights and learns small low-rank update matrices instead. It greatly reduces memory and storage requirements for adaptation.

Question 9

Byte Pair Encoding (BPE)

Accepted Answer

A subword tokenization algorithm that starts with individual characters and iteratively merges the most frequent pairs. It is the foundation of most modern LLM tokenizers (GPT, Llama, etc.).

Question 10

Vocabulary Size

Accepted Answer

The total number of unique tokens a tokenizer can produce. Larger vocabularies (e.g. 128k) can represent text more efficiently but increase the size of the embedding and output layers. Most modern LLMs use vocabularies of 32k–200k tokens.

Question 11

Token Embeddings

Accepted Answer

Learned dense vector representations that map each token ID to a continuous vector. The embedding matrix has one row per vocabulary token; these vectors are the first layer of every transformer and capture semantic and syntactic relationships.

Question 12

Positional Encoding

Accepted Answer

A mechanism that injects token position information into a transformer, since self-attention is otherwise permutation-invariant. Early models used sinusoidal or learned absolute encodings; most modern LLMs use relative encodings such as RoPE.

Question 13

FlashAttention

Accepted Answer

A memory-efficient exact attention algorithm that uses tiling and kernel fusion to avoid materializing the full attention matrix in high-bandwidth memory, leading to major speed and memory improvements during training.

Question 14

DPO (Direct Preference Optimization)

Accepted Answer

A simpler and more stable alternative to RLHF for aligning language models with human preferences. It directly optimizes the model using a classification-style loss on preference data.

Question 15

GRPO (Group Relative Policy Optimization)

Accepted Answer

A reinforcement learning algorithm used in reasoning model training (notably in DeepSeek-R1) that estimates advantages by comparing groups of responses instead of using a separate value model.

Question 16

RLVR (Reinforcement Learning with Verifiable Rewards)

Accepted Answer

A post-training approach where the reward signal comes from automatically checkable answers, such as math results, code tests, or symbolic verifiers, instead of a learned reward model.

Question 17

QK-Norm

Accepted Answer

Normalization applied to the query and key vectors before the attention dot product. It improves training stability, especially in very large models.

Question 18

Sliding Window Attention (SWA)

Accepted Answer

An attention pattern where each token can only attend to a fixed-size local window of previous tokens (plus global tokens in some variants). Used in models like Mistral and Gemma to reduce compute.

Question 19

Causal Attention

Accepted Answer

A self-attention setup where each token can attend only to itself and earlier tokens. This masking is what lets decoder-only LLMs generate text autoregressively.

Question 20

Multi-Query Attention (MQA)

Accepted Answer

An attention variant where all query heads share one set of key and value heads. It reduces KV cache memory more aggressively than GQA, often with a larger quality tradeoff.

Question 21

Hybrid Attention

Accepted Answer

A broad architecture pattern that mixes full attention with cheaper sequence modules such as linear attention or state-space layers, aiming to reduce long-context cost while preserving retrieval ability.

Question 22

Cross-Layer KV Sharing

Accepted Answer

An efficiency technique where some layers reuse key and value tensors from earlier layers, reducing how many separate KV caches grow with context length.

Question 23

Multi-Head Attention (MHA)

Accepted Answer

The original transformer attention mechanism in which the model computes multiple attention 'heads' in parallel, each with its own learned projections for queries, keys, and values.

Question 24

Root Mean Square Layer Normalization (RMSNorm)

Accepted Answer

A simplified and more computationally efficient variant of LayerNorm that only uses the root mean square of the inputs for normalization. Widely adopted in Llama, Mistral, and many modern models.

Question 25

SwiGLU

Accepted Answer

A gated feed-forward block built around a SiLU or Swish-style gate. It replaced older plain GELU MLP blocks in many modern LLM architectures.

Question 26

No Positional Embeddings (NoPE)

Accepted Answer

A design choice where selected attention layers omit explicit positional embeddings. Recent models usually use it selectively rather than removing positional information everywhere.

Question 27

Pretraining

Accepted Answer

The initial large-scale training phase where a model learns general language patterns by predicting the next token on massive unlabeled text corpora.

Question 28

Instruction Finetuning (SFT)

Accepted Answer

The supervised fine-tuning stage after pretraining where the model is trained on instruction-response pairs to follow user instructions and produce helpful outputs.

Question 29

RLHF (Reinforcement Learning from Human Feedback)

Accepted Answer

A post-training technique that uses human preference data and reinforcement learning (usually with a reward model) to align model outputs with human values and preferences.

Question 30

Reasoning Model

Accepted Answer

A large language model trained or prompted to spend more inference compute on multi-step problems, often using reinforcement learning and inference-time scaling techniques.

Question 31

Inference-Time Scaling

Accepted Answer

A family of methods that spend more computation during generation, rather than only during training, to improve reasoning or answer quality.

Question 32

LLM-as-a-Judge

Accepted Answer

An evaluation setup where another language model scores, compares, or critiques generated answers, especially when exact string matching is too brittle.

Question 33

Verifier

Accepted Answer

A program, rule, test suite, or scoring function that checks whether an answer is correct. Verifiers are central to many RLVR and reasoning-evaluation workflows.

Question 34

Perplexity

Accepted Answer

The exponential of average cross-entropy loss on a dataset. It measures next-token prediction quality, not general helpfulness or alignment.

Question 35

Benchmark

Accepted Answer

A standardized dataset and scoring protocol used to compare models. Common examples include MMLU (general knowledge), HumanEval (code), and MATH (reasoning). Scores are only meaningful relative to a fixed prompt format and evaluation setup.

Question 36

Training Data Contamination

Accepted Answer

The presence of evaluation benchmark examples in a model's pretraining data, which can inflate benchmark scores beyond what the model would achieve on truly unseen problems.

Question 37

Few-Shot Prompting

Accepted Answer

Providing a model with a small number of labeled examples inside the prompt before asking it to solve a new instance. Zero-shot uses no examples; few-shot typically uses 1–32. Both are common in benchmark evaluation protocols.

Question 38

Context Length

Accepted Answer

The maximum number of tokens a model can condition on at once. Larger contexts can handle longer documents but increase memory and compute costs.

Question 39

Prompt Template

Accepted Answer

The fixed formatting pattern around user instructions, system messages, and responses. It matters because finetuning teaches the model to follow a specific token pattern.

Question 40

Base Model

Accepted Answer

A model after pretraining but before instruction tuning or alignment. It is good at next-token prediction but not necessarily good at following user instructions.

Question 41

Instruct Model

Accepted Answer

A model tuned on instruction-response examples so it follows prompts, formats answers more usefully, and behaves more like an assistant.

Question 42

Per-Layer Embeddings (PLE)

Accepted Answer

Small layer-specific token vectors added inside the model so edge-scale architectures can gain embedding capacity without widening the full transformer path.

Question 43

Layer-Wise Attention Budgeting

Accepted Answer

An architecture choice that varies attention capacity across layer types, such as assigning different query-head counts to global and sliding-window layers.

Question 44

DeepSeek Sparse Attention

Accepted Answer

A sparse attention mechanism that selects a subset of previous tokens to reduce long-context attention cost in DeepSeek V3.2- and GLM-5-style models.

Question 45

IndexShare

Accepted Answer

An efficiency mechanism for DeepSeek Sparse Attention that computes top-k token indices in one layer and reuses them across nearby layers. In GLM-5.2, each indexer result serves four transformer layers, reducing repeated long-context indexer work.

Question 46

Compressed Sparse Attention / Heavily Compressed Attention

Accepted Answer

Long-context attention variants that compress older context along the sequence dimension so the cache or attention map has fewer effective entries.

Question 47

Gated Attention

Accepted Answer

An attention block that keeps content-based attention but adds gating and stabilizing tweaks so the model can modulate attention output more flexibly.

Question 48

Latent MoE

Accepted Answer

A mixture-of-experts design that performs expert routing in a lower-dimensional latent space before projecting back to the model width.

Question 49

Temperature

Accepted Answer

A decoding control that rescales next-token probabilities. Lower values make outputs more deterministic, while higher values increase randomness.

Question 50

Top-k Sampling

Accepted Answer

A decoding method that samples only from the k most likely next tokens, discarding the rest of the probability distribution.

Question 51

Top-p Sampling

Accepted Answer

A decoding method that samples from the smallest set of likely next tokens whose cumulative probability reaches a chosen threshold p.

Question 52

Distillation

Accepted Answer

A training approach where a smaller or cheaper model learns from outputs produced by a stronger teacher model, often to transfer reasoning behavior or style.

Question 53

Tool Use

Accepted Answer

A model behavior where the LLM calls an external tool, uses the returned result, and incorporates that result into its response.

Question 54

Mixed Precision

Accepted Answer

A training or inference strategy that uses lower-precision number formats where possible to reduce memory use and improve hardware throughput.

Question 55

bfloat16

Accepted Answer

A 16-bit floating-point format with a wide exponent range. It is often easier to use for LLM training than float16 because it is less prone to overflow.

Question 56

torch.compile

Accepted Answer

A PyTorch feature that can speed up repeated workloads by reducing Python overhead and enabling graph-level kernel optimizations.

LLM Glossary

Core Transformer Concepts

Attention and Context Efficiency

Architecture Variants

Tokenization and Embeddings

Training and Finetuning

Post-Training and Reasoning

Inference and Decoding

Evaluation

Systems and Hardware