Nemotron 3 Super Throughput Notes
NVIDIA’s Nemotron 3 Super 120B-A12B is a nice open-weight release because the design is very explicitly aimed at the accuracy-throughput trade-off.
Based on the technical report, model config, and my local architecture card, the 120B total and 12B active model combines several efficiency choices:
- Mamba-2 layers for throughput and long-context efficiency
- Latent MoE layers for sparse scaling at lower inference cost
- Shared-weight multi-token prediction for native speculative decoding
- A small number of GQA layers mixed into the hybrid stack
As of March 2026, the reported benchmark profile looks roughly competitive with GPT-OSS 120B and Qwen3.5 models of similar active scale, while the throughput numbers look stronger. I would treat the exact benchmark ordering as date-sensitive, but the architecture point is more durable. Nemotron 3 Super spends a lot of design effort on reducing latency and cost rather than only pushing raw score.
That makes it a relevant model to watch for local agentic applications, where throughput and cost often matter as much as peak benchmark numbers.
Source: lightly edited website version of my Substack note.
Read Next
VibeThinker-3B and the Strength of Post-Training
Short note on VibeThinker-3B, a 3B model based on Qwen2.5-Coder-3B whose reported coding and reasoning results point to strong post-training.
North Mini Code and Agentic Coding Benchmarks
Short note on North Mini Code, Cohere's 30B total and 3B active open-weight MoE model for agentic coding tasks.
Nemotron 3 Ultra and Latent MoE Scaling
Short note on Nemotron 3 Ultra, NVIDIA's 550B total and 55B active hybrid Mamba-Transformer Latent MoE model.
