DeepSeek Sparse Attention

DeepSeek Sparse Attention is a learned sparse mask that tries to keep the efficiency benefits of local attention without hard-coding a fixed window.

Instead of deciding in advance that every token can only look back a fixed number of positions, DeepSeek Sparse Attention uses a lightweight indexer and token selector to pick which earlier tokens are worth attending to. That makes it more flexible than sliding-window attention, but also more specialized and more implementation-heavy.

Architecture gallery

Flowchart of DeepSeek Sparse Attention with a lightning indexer and token selector — Flowchart from *From DeepSeek V3 to V3.2* summarizing DeepSeek Sparse Attention.

Efficiency target

Reduce long-context attention cost by selecting only a learned subset of prior tokens

Main tradeoff

Better flexibility than fixed sparse windows, but more serving complexity and custom logic

Typical models

DeepSeek V3.2, GLM-5, DeepSeek V3.2-derived variants

Why It Exists

The core problem is familiar: full attention gets expensive at long sequence lengths. Sliding-window attention fixes that by enforcing a local receptive field, but it is rigid because the same geometric rule applies everywhere. DeepSeek Sparse Attention takes a more adaptive route and tries to learn which previous tokens matter for each new query token.

That makes it a more ambitious form of sparse attention. It is not “attend locally”; it is “attend selectively.”

How The Indexer And Selector Work

In the DeepSeek V3.2 write-up, the mechanism is described as two parts: a lightning indexer and a token selector. The indexer scores previous tokens, and the selector then keeps only the top-scoring subset to build the sparse mask. In other words, the sparsity pattern is learned rather than hand-designed.

This is also why the mechanism is tightly coupled to the rest of the DeepSeek serving stack. It is not a generic sparse mask bolted onto any decoder without extra engineering.

How It Differs From SWA

Sliding-window attention chooses a fixed recent window. DeepSeek Sparse Attention chooses a variable set of previous tokens that can come from anywhere in the past sequence if they score highly enough. That makes it more flexible than SWA, but also less transparent and more involved to implement.

The easiest shorthand is: SWA is geometric sparsity, DeepSeek Sparse Attention is learned sparsity.

How To Read It In The Gallery

On the gallery page, this concept appears explicitly in the cards for DeepSeek V3.2 and GLM-5. If a fact sheet says MLA with DeepSeek Sparse Attention, that means the model is compressing KV cache state with MLA and then further reducing long-context attention cost through the learned sparse mask.

Sources

DeepSeek V3.2 article Architecture article

Related concepts

MLA Sliding Window Attention MoE