Latent MoE is still a sparse MoE, but with the difference that the expert computation no longer happens at the model’s full hidden width.

The architecture that introduced it is Nemotron 3 Super. Here, Instead of running the experts directly at width 4096, the model first projects down to 1024, applies the routed experts there, and then projects back up. So the routing idea stays the same, but the expert path itself becomes cheaper (minus the down- and up-projecting overhead).

Nemotron 3 Super architecture showing latent MoE layers
Nemotron 3 Super is currently the only gallery example because it just introduced this latent-space expert path in March 2026 (Original source: The Big LLM Architecture Comparison).

What changes

The expert path moves into a compressed latent space instead of running at full hidden width

Practical benefit

Lower expert-side compute while keeping the basic sparse-routing setup

Example architecture

Nemotron 3 Super 120B-A12B

Start With Regular MoE

Standard MoE already gives us the large gap between total parameters and active parameters per token. The from-scratch MoE materials are useful here because they make that baseline concrete: the whole point is that the model can be huge in total capacity while activating only a small routed subset on each step.

Difference between total and active parameters in MoE
Latent MoE would start from this same sparse-routing logic, then adds another efficiency step inside the expert path itself (Original source: LLMs-from-scratch MoE materials).

The Nemotron 3 Super LatentMoE

In LatentMoE, the inputs are down-projected from 4096 to 1024, the experts operate there, and the outputs are then up-projected back to 4096.

That means the model is not only selecting a sparse subset of experts. It is also making each selected expert cheaper to run. This makes latent MoE feel less like a separate MoE family and more like an additional efficiency tweak on top of standard sparse routing.

Example Architectures

Sources