sebastianraschka.com/llm-architecture-gallery/
Attention Mechanism Distribution
Counts across 70 visible LLM Architecture Gallery cards. Categories are non-exclusive: one model can count in multiple rows when it combines mechanisms such as MLA plus DeepSeek Sparse Attention.
| Mechanism or pattern | Count | Share | Matching models |
|---|---|---|---|
| GQA-family attention Grouped-query, multi-query, or CCA/GQA-style attention. | 43 | Llama 3 8B Llama 3.2 1B Gemma 3 27B Mistral Small 3.1 24B Llama 4 Maverick Qwen3 0.6B Qwen3 235B-A22B Qwen3 30B-A3B Qwen3 32B Qwen3 4B Qwen3 8B SmolLM3 3B GLM-4.5 355B GPT-OSS 120B GPT-OSS 20B Gemma 3 270M Grok 2.5 270B MiniMax M2 230B OLMo 3 32B Nemotron 3 Nano 30B-A3B GLM-4.7 355B Arcee AI Trinity Large 400B Nemotron 3 Super 120B-A12B Gemma 4 31B Gemma 4 26B-A4B Phi-4 GLM-4.5-Air Qwen3 Coder Flash 30B-A3B Step 3.5 Flash 196B Nanbeige 4.1 3B MiniMax M2.5 230B Tiny Aya 3.35B Sarvam 30B Llama 3.2 3B INTELLECT-3 Nemotron 3 Nano 4B MiniMax M2.7 230B Gemma 4 E2B Gemma 4 E4B Tencent Hy3-preview 295B-A21B Laguna XS.2 Granite 4.1 30B ZAYA1-8B | |
| MLA-family attention Multi-head latent attention and closely related MLA variants. | 17 | DeepSeek V3 DeepSeek R1 Kimi K2 Kimi Linear 48B-A3B DeepSeek V3.2 Mistral Large 3 GLM-5 744B Kimi K2.5 Ling 2.5 1T Sarvam 105B LongCat-Flash-Lite 68.5B-A3B Mistral Small 4 GLM-5.1 Kimi K2.6 Ling 2.6 1T DeepSeek V4-Flash DeepSeek V4-Pro | |
| Sliding-window/global patterns Architectures that mix local/chunked/sliding-window layers with global/full attention layers. | 17 | Gemma 3 27B Llama 4 Maverick GPT-OSS 120B GPT-OSS 20B Gemma 3 270M OLMo 3 32B OLMo 3 7B Xiaomi MiMo-V2-Flash 309B Arcee AI Trinity Large 400B Gemma 4 31B Gemma 4 26B-A4B Step 3.5 Flash 196B Tiny Aya 3.35B Gemma 4 E2B Gemma 4 E4B Xiaomi MiMo-V2.5 310B Laguna XS.2 | |
| DeltaNet / Lightning / Kimi Delta Hybrid recurrent or linear-attention style layers paired with attention layers. | 7 | Qwen3 Next 80B-A3B Kimi Linear 48B-A3B Ling 2.5 1T Qwen3.5 397B Qwen3.6 35B-A3B Qwen3.6 27B Ling 2.6 1T | |
| Mamba / mLSTM recurrent layers Mamba-2, mLSTM, or recurrent state-space-style blocks. | 4 | Nemotron 3 Nano 30B-A3B Nemotron 3 Super 120B-A12B xLSTM 7B Nemotron 3 Nano 4B | |
| DeepSeek Sparse Attention Explicit DeepSeek Sparse Attention variants. | 3 | DeepSeek V3.2 GLM-5 744B GLM-5.1 | |
| MHA-family attention Classic multi-head attention without GQA/MLA as the main mechanism. | 3 | GPT-2 XL 1.5B OLMo 2 7B OLMo 3 7B | |
| CSA/HCA Compressed sparse or hyper-compressed attention variants in DeepSeek V4-style models. | 2 | DeepSeek V4-Flash DeepSeek V4-Pro | |
| CCA Compressed context attention. | 1 | ZAYA1-8B | |
| No self-attention Recurrent architecture entries with no self-attention layers. | 1 | xLSTM 7B |
Note: this table counts visible gallery cards, not unique model families. It uses the gallery metadata fields for attention, layer mix, and decoder type.