How do architectures such as GPT, Llama, Qwen, and Gemma differ at a high level?

Question

Accepted Answer

At a high level, GPT, Llama, Qwen, and Gemma all belong to the same broad family: they are transformer-based autoregressive language models. But they differ in several design choices that affect efficiency, scaling behavior, and deployment tradeoffs.

The easiest way to think about them is:

GPT represents the earlier, simpler decoder-only transformer pattern.
Llama is a modernized GPT-like family with more efficient building blocks.
Qwen is another modern family that extends the Llama-style pattern with additional choices such as GQA, qk normalization, and MoE variants.
Gemma is also a modern family, but with some distinctive choices around attention patterns and normalization layout.

The repo’s GPT-to-Llama material shows the broad architectural progression from GPT-style models to later Llama variants

Here are the main differences in broad terms.

GPT-style models

typically use learned absolute positional embeddings
use LayerNorm
use standard multi-head attention
use a simpler feed-forward setup such as GELU MLP blocks

This is the clean educational baseline used in the early chapters of the repo.

Llama-family models

replace LayerNorm with RMSNorm
replace absolute positional embeddings with RoPE
use bias-free linear layers
use SwiGLU feed-forward blocks instead of the older GPT-style GELU block
later variants use GQA for more efficient inference

These changes are meant to preserve strong performance while improving efficiency and scaling.

Qwen

follows a modern RoPE + RMSNorm style architecture
uses GQA
includes additional details such as qk normalization
comes in both dense and MoE variants in the repo
often uses large vocabularies and product variants such as base, instruct, coder, and reasoning models

The Qwen overview in the repo highlights both dense and MoE-style modern variants

Gemma

is also in the modern RoPE + RMSNorm family
uses GQA
includes different normalization placement details from simpler GPT/Llama baselines
in Gemma 3, emphasizes efficiency-oriented attention design such as sliding-window attention mixed with global attention

The Gemma 3 material compares Gemma’s architecture choices with Qwen as a modern reference point

So the most important point is not that these models come from completely different worlds. They are all descendants of the transformer decoder idea. The real differences are in architectural refinements such as:

absolute versus rotary positional encoding
LayerNorm versus RMSNorm
standard MHA versus GQA
GELU feed-forward blocks versus SwiGLU-style blocks
full attention versus hybrids that include local or sliding-window attention
dense versus sparse MoE variants

That is why modern architecture comparisons often feel incremental rather than revolutionary. Most of the time, the models share the same high-level blueprint and differ in how they optimize the details for scale, efficiency, and deployment.

In short, GPT is the simpler older decoder-only baseline, while Llama, Qwen, and Gemma are more modern transformer families that keep the same overall idea but make different choices about normalization, positional encoding, attention efficiency, and feed-forward design.