DeepSeek Sparse Attention

DeepSeek Sparse Attention is one of the architectural changes that appeared in the DeepSeek V3.2 line and later showed up again in GLM-5.

In the gallery, it is best read as part of a broader efficiency-oriented stack. DeepSeek V3.2 combines it with Multi-head Latent Attention (MLA), and GLM-5 adopts the same pair for the same general reason, namely, reducing inference cost when context lengths get large.

Architecture gallery

GLM-5 versus GLM-4.5 architecture comparison showing the adoption of DeepSeek Sparse Attention — GLM-5 is a useful example because it imports both MLA and DeepSeek Sparse Attention into a previously more conventional GLM stack (Original source: *The Big LLM Architecture Comparison*).

Summary

Use a lightning indexer and token selector to keep only a learned subset of past tokens

Practical benefit

Reduce long-context attention cost without hard-coding a fixed local window

Example architectures

DeepSeek V3.2 and GLM-5

Changes Relative To Sliding-Window Attention

Let’s briefly take a step back and start with sliding-window attention. In sliding-window attention, the current token does not attend to the full prefix but only to a fixed local window. This is the same broad idea behind DeepSeek Sparse Attention, where each token also only attends to a subset of previous tokens.

However, the selected tokens are not determined by a fixed-width local window. Instead, DeepSeek Sparse Attention uses a learned sparse pattern. In short, it uses an indexer-plus-selector setup, where a lightning indexer computes relevance scores, and a token selector keeps only a smaller set of high-scoring past positions.

The way the subset of tokens is selected is the main difference from sliding-window attention. Sliding-window attention hard-codes locality. DeepSeek Sparse Attention still limits attention to a subset, but it lets the model decide which prior tokens are worth revisiting.

Side-by-side comparison of regular causal attention, sliding-window attention, and DeepSeek Sparse Attention — Regular causal attention lets a token attend to the whole visible prefix, sliding-window attention keeps only a fixed local window, and DeepSeek Sparse Attention keeps a selected subset of past tokens instead (Original source: *From DeepSeek V3 to V3.2: Architecture, Sparse Attention, and RL Updates*).

DeepSeek Sparse Attention still avoids full-prefix attention, but unlike sliding-window attention, the kept positions do not have to form one contiguous local block. They can come from different parts of the prefix if the indexer and selector score them highly enough.

That is why DeepSeek Sparse Attention is not just a small adjustment to a standard decoder block like sliding-window attention, but a more specific decision about how the model should revisit prior context.

DeepSeek Sparse Attention and MLA

DeepSeek V3.2 uses both Multi-head Latent Attention (MLA) and DeepSeek Sparse Attention. MLA reduces KV-cache cost by compressing what gets stored. DeepSeek Sparse Attention reduces how much of the prior context the model has to revisit. Put differently, one optimizes the cache representation, the other optimizes the attention pattern on top of it.

DeepSeek V3.2 architecture figure — DeepSeek V3.2 is the obvious reference point, because this is the model family most closely associated with the sparse-attention idea (Original source: *From DeepSeek V3 to V3.2: Architecture, Sparse Attention, and RL Updates*).

The sparse pattern is not random. The first stage is a lightning indexer that scores previous tokens for each new query token. It uses MLA’s compressed token representations and computes a learned similarity score over the prior context, so the model can rank which earlier positions are worth revisiting.

The second stage is a token selector. It keeps only a smaller high-scoring subset, for example, a top-k set of past positions, and turns that subset into the sparse attention mask. So the main point is that DeepSeek Sparse Attention does not hard-code the sparsity pattern. It learns which past tokens to keep.

DeepSeek Sparse Attention flowchart from the DeepSeek V3.2 article — The mechanism consists of a lightning indexer that scores prior tokens and a selector that keeps only a smaller subset for attention (Original source: *From DeepSeek V3 to V3.2: Architecture, Sparse Attention, and RL Updates*).

DeepSeek Sparse Attention is relatively new and relatively complicated to implement, which is why it has not been so widely adopted as Grouped-Query Attention (GQA) yet.

Sources

The Big LLM Architecture Comparison A Dream of Spring for Open-Weight LLMs DeepSeek V3.2 technical article DeepSeek V3.2 paper GLM-5 paper

Back to architecture gallery