Back to the Gallery

sebastianraschka.com/llm-architecture-gallery/

Percent Active Parameters per Token

# Model Active % Active params Total params Type Date Attention
1 DeepSeek V4-Pro 3.1% 49B active 1.6T MoE 2026-04-24 CSA/HCA
2 Kimi K2 3.2% 32B active 1T MoE 2025-07-10 MLA
3 Kimi K2.5 3.2% 32B active 1T MoE 2026-01-27 MLA
4 Kimi K2.6 3.2% 32B active 1T MoE 2026-04-20 MLA
5 Arcee AI Trinity Large 400B 3.3% 13B active 400B MoE 2026-01-27 3:1 sliding-window/global gated GQA
6 Qwen3 Next 80B-A3B 3.8% 3B active 80B Hybrid 2025-09-09 3:1 Gated DeltaNet and Gated Attention
7 Llama 4 Maverick 4.3% 17B active 400B MoE 2025-04-05 GQA
8 MiniMax M2 230B 4.3% 10B active 230B MoE 2025-10-23 GQA
9 MiniMax M2.5 230B 4.3% 10B active 230B MoE 2026-02-12 GQA
10 Qwen3.5 397B 4.3% 17B active 397B Hybrid 2026-02-16 3:1 Gated DeltaNet and Gated Attention
11 MiniMax M2.7 230B 4.3% 10B active 230B MoE 2026-03-18 GQA
12 GPT-OSS 120B 4.4% 5.1B active 117B MoE 2025-08-04 Alternating sliding-window/global GQA
13 LongCat-Flash-Lite 68.5B-A3B 4.4% 3B active 68.5B MoE 2026-01-28 MLA
14 DeepSeek V4-Flash 4.6% 13B active 284B MoE 2026-04-24 CSA/HCA
15 Xiaomi MiMo-V2.5 310B 4.8% 15B active 310B MoE 2026-04-22 5:1 sliding-window/global attention
16 Xiaomi MiMo-V2-Flash 309B 4.9% 15B active 309B MoE 2025-12-16 5:1 sliding-window/global attention
17 GLM-5 744B 5.4% 40B active 744B MoE 2026-02-11 MLA with DeepSeek Sparse Attention
18 GLM-5.1 5.4% 40B active 744B MoE 2026-04-07 MLA with DeepSeek Sparse Attention
19 DeepSeek V3 5.5% 37B active 671B MoE 2024-12-26 MLA
20 DeepSeek R1 5.5% 37B active 671B MoE 2025-01-20 MLA
21 DeepSeek V3.2 5.5% 37B active 671B MoE 2025-12-01 MLA with DeepSeek Sparse Attention
22 Step 3.5 Flash 196B 5.6% 11B active 196B MoE 2026-02-01 3:1 sliding-window GQA
23 Mistral Small 4 5.6% 6.63B active 119B MoE 2026-03-16 MLA
24 Mistral Large 3 6.1% 41B active 673B MoE 2025-12-02 MLA
25 Kimi Linear 48B-A3B 6.3% 3B active 48B Hybrid 2025-10-30 3:1 Kimi Delta Attention and MLA
26 Ling 2.5 1T 6.3% 63B active 1T Hybrid 2026-02-15 Lightning Attention plus MLA
27 Ling 2.6 1T 6.3% 63B active 1T Hybrid 2026-04-23 Lightning Attention plus MLA
28 Tencent Hy3-preview 295B-A21B 7.1% 21B active 295B MoE 2026-04-23 GQA
29 Sarvam 30B 8% 2.4B active 30B MoE 2026-03-03 GQA
30 Qwen3.6 35B-A3B 8.6% 3B active 35B Hybrid 2026-04-15 3:1 Gated DeltaNet and Gated Attention
31 GLM-4.5 355B 9% 32B active 355B MoE 2025-07-28 GQA
32 GLM-4.7 355B 9% 32B active 355B MoE 2025-12-22 GQA
33 ZAYA1-8B 9% 760M active 8.4B MoE 2026-05-06 CCA with 4:1 GQA
34 Laguna XS.2 9.1% 3B active 33B MoE 2026-04-28 3:1 sliding-window/global gated GQA
35 Qwen3 235B-A22B 9.4% 22B active 235B MoE 2025-04-28 GQA
36 Sarvam 105B 9.8% 10.3B active 105B MoE 2026-03-03 MLA
37 Qwen3 30B-A3B 10% 3B active 30B MoE 2025-04-28 GQA
38 Nemotron 3 Nano 30B-A3B 10% 3B active 30B Hybrid MoE 2025-12-04 Mamba-2 + GQA
39 Nemotron 3 Super 120B-A12B 10% 12B active 120B Hybrid MoE 2026-03-11 Mamba-2 + GQA
40 Qwen3 Coder Flash 30B-A3B 11% 3.3B active 30B MoE 2025-07-31 GQA
41 GLM-4.5-Air 11.3% 12B active 106B MoE 2025-07-28 GQA
42 INTELLECT-3 11.3% 12B active 106B MoE 2025-11-26 GQA
43 Gemma 4 26B-A4B 15.1% 3.8B active 25.2B MoE 2026-04-02 5:1 sliding-window/global GQA
44 GPT-OSS 20B 17.1% 3.6B active 21B MoE 2025-08-04 Alternating sliding-window/global GQA

Caveat: active parameter share is only one lens. It does not capture KV cache size, attention pattern, context length, routing overhead, hardware efficiency, or training quality. But it is a helpful quick check when comparing sparse models.