Gated Attention

Gated attention is best understood as a modified full-attention block rather than as a separate attention family.

In the architectures covered here, it usually appears inside hybrid stacks that still keep an occasional full-attention layer for exact content retrieval, but add a few stability-oriented changes on top of an otherwise familiar scaled dot-product attention block.

Architecture gallery Qwen3.5 from-scratch Nb

Arcee AI Trinity Large architecture and gated attention code snippet — Trinity Large is a useful comparison because gated attention is not only a Qwen idea. Here the gate appears after the scaled dot-product attention output and before the output projection in a different long-context architecture (Original source: *A Dream of Spring for Open-Weight LLMs*).

Where It Appears

Qwen3-Next and Qwen3.5 architectures show that these recent hybrids do not replace attention everywhere. Instead, they replace most attention layers with a cheaper alternative and keep a smaller number of full-attention layers in the stack.

Those remaining full-attention layers are where gated attention typically appears. Qwen3-Next and Qwen3.5 use it together with Gated DeltaNet in a 3:1 pattern.

But hybrid architectures aside, Trinity uses a related gating idea in a more conventional attention stack, as shown in the previous figure above.

What Changes Relative To Standard Attention

The gated attention block in Qwen-style hybrids is essentially standard scaled-dot-product attention with a few changes on top. In the original Gated Attention paper, those changes are presented as a way to make the retained full-attention layers behave more predictably inside a hybrid stack.

The block still looks like standard (full) attention, but it adds:

an output gate that scales the attention result before it is added back to the residual,
a zero-centered QK-Norm variant instead of standard RMSNorm for q and k,
partial RoPE.

These are not changes on the scale of MLA or linear attention. They are better understood as stability and control changes applied to an otherwise familiar attention block.

Qwen3 Next architecture showing gated attention as part of a hybrid stack — In Qwen3-Next and Qwen3.5, gated attention appears as the full-attention layer that periodically breaks up runs of Gated DeltaNet blocks.

Gated Attention in Trinity

As mentioned above, Trinity does not use the full Qwen-style Gated DeltaNet hybrid, but it borrows a similar gating idea for attention. Trinity applies elementwise gating to the scaled dot-product before the output projection, which the Trinity technical report associates with reduced attention sinks, better long-sequence generalization, and improved training stability.

Sources

The Big LLM Architecture Comparison A Dream of Spring for Open-Weight LLMs Gated Attention paper Trinity technical report Qwen3.5 implementation notes

Back to architecture gallery