Grouped-Query Attention (GQA)

Grouped-query attention is an attention variant derived from standard multi-head attention. It was introduced in the 2023 paper GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints by Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Instead of giving every query head its own keys and values, it lets several query heads share the same key-value projections, which makes KV caching much cheaper without changing the overall decoder recipe very much.

Architecture gallery From-scratch code

Comparison between multi-head attention and grouped-query attention — GQA keeps the same overall attention pattern as MHA, but collapses the number of key-value heads by sharing them across multiple query heads (Original source: *The Big LLM Architecture Comparison*).

What it optimizes

KV-cache size, memory bandwidth, and long-context inference cost

Practical benefit

It delivers most of the practical benefits of a leaner attention stack without changing the decoder recipe much

Example architectures

Dense: Llama 3 8B, Qwen3 4B, Gemma 3 27B, Mistral Small 3.1 24B, SmolLM3 3B, and Tiny Aya 3.35B.
Sparse: Llama 4 Maverick, Qwen3 235B-A22B, Step 3.5 Flash 196B, and Sarvam 30B.

Why GQA Became Popular

In the first architecture comparison article, I framed GQA as the new standard replacement for classic multi-head attention (MHA). The reason is that standard MHA gives every head its own keys and values, which is more optimal from a modeling perspective but expensive once we have to keep all of that state in the KV cache during inference.

In GQA, we keep a larger set of query heads, but we reduce the number of key-value heads and let multiple queries share them. That lowers both parameter count and KV-cache traffic without making drastic implementation changes like multi-head latent attention (MLA). In practice, that made it a very attractive operating point for labs that wanted something cheaper than MHA but simpler to implement than newer compression-heavy alternatives like MLA.

Memory Savings

GQA results in big savings in KV storage and thus lower memory requirements, since the fewer key-value heads we keep per layer, the less cached state we need per token. That is why GQA becomes more useful as sequence length grows.

GQA is also a spectrum. If we reduce all the way down to one shared K/V group, we are effectively in multi-query attention territory, which is even cheaper but can hurt modeling quality more noticeably. The sweet spot is usually somewhere in between multi-query attention (1 shared group) and MHA (where K/V groups are equal to the number of queries), where the cache savings are large but the modeling degradation relative to MHA stays modest.

Memory savings of grouped-query attention versus multi-head attention — Once the context window grows, KV-cache savings become more pronounced (Original source: *LLMs-from-scratch* GQA materials).

Why GQA Still Matters In 2026

More advanced variants such as MLA are becoming popular because they can offer better modeling performance at the same KV efficiency levels (e.g., as discussed in the ablation studies of the DeepSeek-V2 paper), but they also involve a more complicated implementation. GQA remains appealing precisely because it is robust, easier to implement, and also easier to train (since there are fewer hyperparameter tunings necessary, based on my experience).

That is why some of the newer releases still stay deliberately classic here. E.g., in my Spring Architectures article, I mentioned that MiniMax M2.5 and Nanbeige 4.1 as models that remained very classic, using only grouped-query attention without piling on other efficiency tricks. Sarvam is a particularly useful comparison point as well: the 30B model keeps classic GQA, while the 105B version switches to MLA.

Relative efficiency comparison between grouped-query attention, multi-head latent attention, and multi-head attention — GQA and MLA are both best understood as responses to the same bottleneck. GQA is the simpler fix; MLA usually pushes efficiency further, but at the cost of extra complexity (Original source: *A Dream of Spring for Open-Weight LLMs*).

Sources

Ainslie et al. (2023), GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints The Big LLM Architecture Comparison A Dream of Spring for Open-Weight LLMs LLMs-from-scratch GQA materials

Back to architecture gallery