Mixture of Experts is one of the main reasons recent open-weight models can have very large total parameter counts without making every inference step equally expensive.

The basic idea is to replace a single dense FeedForward block with multiple expert FeedForward blocks and then use a router so that only a small subset is active for each token.

Mixture-of-Experts module in DeepSeek V3 and R1 compared to a standard feed-forward block
The main structural change is straightforward. A single dense FeedForward block is replaced by a routed MoE module with multiple experts (Original source: The Big LLM Architecture Comparison).

What changes

One dense feed-forward path becomes several expert feed-forward paths plus a router

Practical benefit

The model can have much higher total capacity while keeping only a smaller active path per token

Why It Matters

The FeedForward block already accounts for a large share of the parameters inside a transformer layer. So when we replace one FeedForward block with many expert blocks, the model’s total parameter count can increase substantially.

The key point is that the router does not activate all experts for every token. It selects only a small subset. That is why an MoE model can be very large in total capacity while still using a much smaller number of active parameters on each inference step.

Difference between total and active parameters in Mixture of Experts layers
As the number of experts increases, total parameters rise much faster than the active parameters per token (Original source: LLMs-from-scratch MoE materials).

What “Sparse” Means Here

MoE layers are often described as sparse because not all experts are used for every token. The model is large, but the computation per token is selective.

This is why MoE model cards often list both total parameters and active parameters. DeepSeek V3 is the standard example: the total size is very large, but only a much smaller subset is active per step.

Shared Experts And Variants

Once the basic MoE idea became common, teams started to vary the details. One example is the shared expert, which is always active in addition to the routed experts. Another is latent MoE, where the expert computation is moved into a smaller latent space, as in Nemotron 3 Super.

So while many models are called MoE, they can still differ quite a bit in expert count, routed experts per token, whether they use shared experts, and how large the expert subnetwork is.

Example Architectures

Sources