DeepSeek Sparse Attention From Scratch
I added a DeepSeek Sparse Attention from-scratch implementation to the LLMs-from-scratch repository, thanks to an excellent reader contribution.
The folder includes a README, a standalone GPT-style reference implementation, and tests:
The main idea behind DeepSeek Sparse Attention is to replace a fixed sparse pattern with a learned sparse pattern. Instead of using only a local window, the mechanism uses a lightweight indexer and selector to decide which prior tokens are worth attending to.
For more background, I also have a local DeepSeek Sparse Attention concept page and a gallery explainer that compare it with regular causal attention and sliding-window attention.
Source: lightly edited website version of my Substack note.
Read Next
VibeThinker-3B and the Strength of Post-Training
Short note on VibeThinker-3B, a 3B model based on Qwen2.5-Coder-3B whose reported coding and reasoning results point to strong post-training.
North Mini Code and Agentic Coding Benchmarks
Short note on North Mini Code, Cohere's 30B total and 3B active open-weight MoE model for agentic coding tasks.
Nemotron 3 Ultra and Latent MoE Scaling
Short note on Nemotron 3 Ultra, NVIDIA's 550B total and 55B active hybrid Mamba-Transformer Latent MoE model.
