sebastianraschka.com/llm-architecture-gallery/
Percent Active Parameters per Token
| # | Model | Active % | Active params | Total params | Type | Date | Attention |
|---|---|---|---|---|---|---|---|
| 1 | DeepSeek V4-Pro | 3.1% | 49B active | 1.6T | MoE | 2026-04-24 | CSA/HCA |
| 2 | Kimi K2 | 3.2% | 32B active | 1T | MoE | 2025-07-10 | MLA |
| 3 | Kimi K2.5 | 3.2% | 32B active | 1T | MoE | 2026-01-27 | MLA |
| 4 | Kimi K2.6 | 3.2% | 32B active | 1T | MoE | 2026-04-20 | MLA |
| 5 | Arcee AI Trinity Large 400B | 3.3% | 13B active | 400B | MoE | 2026-01-27 | 3:1 sliding-window/global gated GQA |
| 6 | Qwen3 Next 80B-A3B | 3.8% | 3B active | 80B | Hybrid | 2025-09-09 | 3:1 Gated DeltaNet and Gated Attention |
| 7 | Llama 4 Maverick | 4.3% | 17B active | 400B | MoE | 2025-04-05 | GQA |
| 8 | MiniMax M2 230B | 4.3% | 10B active | 230B | MoE | 2025-10-23 | GQA |
| 9 | MiniMax M2.5 230B | 4.3% | 10B active | 230B | MoE | 2026-02-12 | GQA |
| 10 | Qwen3.5 397B | 4.3% | 17B active | 397B | Hybrid | 2026-02-16 | 3:1 Gated DeltaNet and Gated Attention |
| 11 | MiniMax M2.7 230B | 4.3% | 10B active | 230B | MoE | 2026-03-18 | GQA |
| 12 | GPT-OSS 120B | 4.4% | 5.1B active | 117B | MoE | 2025-08-04 | Alternating sliding-window/global GQA |
| 13 | LongCat-Flash-Lite 68.5B-A3B | 4.4% | 3B active | 68.5B | MoE | 2026-01-28 | MLA |
| 14 | DeepSeek V4-Flash | 4.6% | 13B active | 284B | MoE | 2026-04-24 | CSA/HCA |
| 15 | Xiaomi MiMo-V2.5 310B | 4.8% | 15B active | 310B | MoE | 2026-04-22 | 5:1 sliding-window/global attention |
| 16 | Xiaomi MiMo-V2-Flash 309B | 4.9% | 15B active | 309B | MoE | 2025-12-16 | 5:1 sliding-window/global attention |
| 17 | GLM-5 744B | 5.4% | 40B active | 744B | MoE | 2026-02-11 | MLA with DeepSeek Sparse Attention |
| 18 | GLM-5.1 | 5.4% | 40B active | 744B | MoE | 2026-04-07 | MLA with DeepSeek Sparse Attention |
| 19 | DeepSeek V3 | 5.5% | 37B active | 671B | MoE | 2024-12-26 | MLA |
| 20 | DeepSeek R1 | 5.5% | 37B active | 671B | MoE | 2025-01-20 | MLA |
| 21 | DeepSeek V3.2 | 5.5% | 37B active | 671B | MoE | 2025-12-01 | MLA with DeepSeek Sparse Attention |
| 22 | Step 3.5 Flash 196B | 5.6% | 11B active | 196B | MoE | 2026-02-01 | 3:1 sliding-window GQA |
| 23 | Mistral Small 4 | 5.6% | 6.63B active | 119B | MoE | 2026-03-16 | MLA |
| 24 | Mistral Large 3 | 6.1% | 41B active | 673B | MoE | 2025-12-02 | MLA |
| 25 | Kimi Linear 48B-A3B | 6.3% | 3B active | 48B | Hybrid | 2025-10-30 | 3:1 Kimi Delta Attention and MLA |
| 26 | Ling 2.5 1T | 6.3% | 63B active | 1T | Hybrid | 2026-02-15 | Lightning Attention plus MLA |
| 27 | Ling 2.6 1T | 6.3% | 63B active | 1T | Hybrid | 2026-04-23 | Lightning Attention plus MLA |
| 28 | Tencent Hy3-preview 295B-A21B | 7.1% | 21B active | 295B | MoE | 2026-04-23 | GQA |
| 29 | Sarvam 30B | 8% | 2.4B active | 30B | MoE | 2026-03-03 | GQA |
| 30 | Qwen3.6 35B-A3B | 8.6% | 3B active | 35B | Hybrid | 2026-04-15 | 3:1 Gated DeltaNet and Gated Attention |
| 31 | GLM-4.5 355B | 9% | 32B active | 355B | MoE | 2025-07-28 | GQA |
| 32 | GLM-4.7 355B | 9% | 32B active | 355B | MoE | 2025-12-22 | GQA |
| 33 | ZAYA1-8B | 9% | 760M active | 8.4B | MoE | 2026-05-06 | CCA with 4:1 GQA |
| 34 | Laguna XS.2 | 9.1% | 3B active | 33B | MoE | 2026-04-28 | 3:1 sliding-window/global gated GQA |
| 35 | Qwen3 235B-A22B | 9.4% | 22B active | 235B | MoE | 2025-04-28 | GQA |
| 36 | Sarvam 105B | 9.8% | 10.3B active | 105B | MoE | 2026-03-03 | MLA |
| 37 | Qwen3 30B-A3B | 10% | 3B active | 30B | MoE | 2025-04-28 | GQA |
| 38 | Nemotron 3 Nano 30B-A3B | 10% | 3B active | 30B | Hybrid MoE | 2025-12-04 | Mamba-2 + GQA |
| 39 | Nemotron 3 Super 120B-A12B | 10% | 12B active | 120B | Hybrid MoE | 2026-03-11 | Mamba-2 + GQA |
| 40 | Qwen3 Coder Flash 30B-A3B | 11% | 3.3B active | 30B | MoE | 2025-07-31 | GQA |
| 41 | GLM-4.5-Air | 11.3% | 12B active | 106B | MoE | 2025-07-28 | GQA |
| 42 | INTELLECT-3 | 11.3% | 12B active | 106B | MoE | 2025-11-26 | GQA |
| 43 | Gemma 4 26B-A4B | 15.1% | 3.8B active | 25.2B | MoE | 2026-04-02 | 5:1 sliding-window/global GQA |
| 44 | GPT-OSS 20B | 17.1% | 3.6B active | 21B | MoE | 2025-08-04 | Alternating sliding-window/global GQA |
Caveat: active parameter share is only one lens. It does not capture KV cache size, attention pattern, context length, routing overhead, hardware efficiency, or training quality. But it is a helpful quick check when comparing sparse models.