I added a DeepSeek Sparse Attention from-scratch implementation to the LLMs-from-scratch repository, thanks to an excellent reader contribution.

The folder includes a README, a standalone GPT-style reference implementation, and tests:

  1. README.md
  2. gpt_with_kv_dsa.py
  3. test_dsa.py

The main idea behind DeepSeek Sparse Attention is to replace a fixed sparse pattern with a learned sparse pattern. Instead of using only a local window, the mechanism uses a lightweight indexer and selector to decide which prior tokens are worth attending to.

For more background, I also have a local DeepSeek Sparse Attention concept page and a gallery explainer that compare it with regular causal attention and sliding-window attention.

DeepSeek Sparse Attention implementation overview

Screenshot from the original Substack note, showing the DeepSeek Sparse Attention implementation folder and README overview.

Source: lightly edited website version of my Substack note.