Back to the Gallery

sebastianraschka.com/llm-architecture-gallery/

Attention Mechanism Distribution

Counts across 70 visible LLM Architecture Gallery cards. Categories are non-exclusive: one model can count in multiple rows when it combines mechanisms such as MLA plus DeepSeek Sparse Attention.

Mechanism or pattern Count Share Matching models
GQA-family attention Grouped-query, multi-query, or CCA/GQA-style attention. 43 Llama 3 8B Llama 3.2 1B Gemma 3 27B Mistral Small 3.1 24B Llama 4 Maverick Qwen3 0.6B Qwen3 235B-A22B Qwen3 30B-A3B Qwen3 32B Qwen3 4B Qwen3 8B SmolLM3 3B GLM-4.5 355B GPT-OSS 120B GPT-OSS 20B Gemma 3 270M Grok 2.5 270B MiniMax M2 230B OLMo 3 32B Nemotron 3 Nano 30B-A3B GLM-4.7 355B Arcee AI Trinity Large 400B Nemotron 3 Super 120B-A12B Gemma 4 31B Gemma 4 26B-A4B Phi-4 GLM-4.5-Air Qwen3 Coder Flash 30B-A3B Step 3.5 Flash 196B Nanbeige 4.1 3B MiniMax M2.5 230B Tiny Aya 3.35B Sarvam 30B Llama 3.2 3B INTELLECT-3 Nemotron 3 Nano 4B MiniMax M2.7 230B Gemma 4 E2B Gemma 4 E4B Tencent Hy3-preview 295B-A21B Laguna XS.2 Granite 4.1 30B ZAYA1-8B
MLA-family attention Multi-head latent attention and closely related MLA variants. 17 DeepSeek V3 DeepSeek R1 Kimi K2 Kimi Linear 48B-A3B DeepSeek V3.2 Mistral Large 3 GLM-5 744B Kimi K2.5 Ling 2.5 1T Sarvam 105B LongCat-Flash-Lite 68.5B-A3B Mistral Small 4 GLM-5.1 Kimi K2.6 Ling 2.6 1T DeepSeek V4-Flash DeepSeek V4-Pro
Sliding-window/global patterns Architectures that mix local/chunked/sliding-window layers with global/full attention layers. 17 Gemma 3 27B Llama 4 Maverick GPT-OSS 120B GPT-OSS 20B Gemma 3 270M OLMo 3 32B OLMo 3 7B Xiaomi MiMo-V2-Flash 309B Arcee AI Trinity Large 400B Gemma 4 31B Gemma 4 26B-A4B Step 3.5 Flash 196B Tiny Aya 3.35B Gemma 4 E2B Gemma 4 E4B Xiaomi MiMo-V2.5 310B Laguna XS.2
DeltaNet / Lightning / Kimi Delta Hybrid recurrent or linear-attention style layers paired with attention layers. 7 Qwen3 Next 80B-A3B Kimi Linear 48B-A3B Ling 2.5 1T Qwen3.5 397B Qwen3.6 35B-A3B Qwen3.6 27B Ling 2.6 1T
Mamba / mLSTM recurrent layers Mamba-2, mLSTM, or recurrent state-space-style blocks. 4 Nemotron 3 Nano 30B-A3B Nemotron 3 Super 120B-A12B xLSTM 7B Nemotron 3 Nano 4B
DeepSeek Sparse Attention Explicit DeepSeek Sparse Attention variants. 3 DeepSeek V3.2 GLM-5 744B GLM-5.1
MHA-family attention Classic multi-head attention without GQA/MLA as the main mechanism. 3 GPT-2 XL 1.5B OLMo 2 7B OLMo 3 7B
CSA/HCA Compressed sparse or hyper-compressed attention variants in DeepSeek V4-style models. 2 DeepSeek V4-Flash DeepSeek V4-Pro
CCA Compressed context attention. 1 ZAYA1-8B
No self-attention Recurrent architecture entries with no self-attention layers. 1 xLSTM 7B

Note: this table counts visible gallery cards, not unique model families. It uses the gallery metadata fields for attention, layer mix, and decoder type.