NVIDIA’s Nemotron 3 Super 120B-A12B is a nice open-weight release because the design is very explicitly aimed at the accuracy-throughput trade-off.

Based on the technical report, model config, and my local architecture card, the 120B total and 12B active model combines several efficiency choices:

  1. Mamba-2 layers for throughput and long-context efficiency
  2. Latent MoE layers for sparse scaling at lower inference cost
  3. Shared-weight multi-token prediction for native speculative decoding
  4. A small number of GQA layers mixed into the hybrid stack

As of March 2026, the reported benchmark profile looks roughly competitive with GPT-OSS 120B and Qwen3.5 models of similar active scale, while the throughput numbers look stronger. I would treat the exact benchmark ordering as date-sensitive, but the architecture point is more durable. Nemotron 3 Super spends a lot of design effort on reducing latency and cost rather than only pushing raw score.

That makes it a relevant model to watch for local agentic applications, where throughput and cost often matter as much as peak benchmark numbers.

Nemotron 3 Super 120B-A12B architecture diagram and benchmark comparison

Figure from the original Substack note, showing the Nemotron 3 Super architecture and selected benchmark and throughput comparisons.

Source: lightly edited website version of my Substack note.