CSA and HCA

Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) are the long-context attention mechanisms used in DeepSeek V4. Both reduce the length of the KV cache by compressing groups of previous tokens into fewer KV entries.

This is different from MLA-style per-token latent caching. MLA keeps one compressed latent entry per token. CSA and HCA shorten the sequence of cached entries, so the model has fewer past entries to store and attend over at million-token context lengths.

Architecture gallery Article section DeepSeek V4 report MLA explainer

MLA-style cache, CSA, and HCA compared conceptually — Figure 21. MLA compresses the stored representation for each token, while CSA and HCA compress along the sequence dimension. CSA uses milder sequence compression with sparse top-k selection, and HCA uses heavier sequence compression with dense attention over the shorter cache (Original source *Recent Developments in LLM Architectures*).

What changes

Past tokens are summarized into fewer compressed KV entries

Practical benefit

The long-context cache becomes shorter, reducing memory and attention cost

Example architectures

DeepSeek V4-Pro and DeepSeek V4-Flash

Sequence Compression

The main distinction is the compression axis. MLA mainly compresses the per-token KV representation. It still keeps one latent KV entry for each previous token.

CSA and HCA compress along the sequence dimension. Instead of keeping one entry per previous token, they summarize groups of tokens into fewer entries. This makes the stored history shorter.

In DeepSeek V4, CSA uses a milder compression rate. The figure example shows groups of 4 tokens being compressed into one entry, followed by sparse top-k selection. HCA is more aggressive. It compresses 128 tokens into one compressed KV entry and then attends densely over that shorter cache.

CSA sparse selection and HCA dense attention over compressed history — Figure 22. CSA selects a sparse set of compressed history blocks, while HCA attends densely over more heavily compressed blocks. Both also keep a local sliding-window branch for recent uncompressed KV entries (Original source *Recent Developments in LLM Architectures*).

Why DeepSeek Uses Both

CSA and HCA make different tradeoffs. CSA keeps more compressed context detail, then uses sparse selection to choose which compressed entries to attend to. HCA keeps far fewer entries and can afford dense attention over those entries.

DeepSeek V4 alternates these mechanisms instead of relying on one compression pattern throughout the stack. The CSA layers provide a less aggressive path through compressed history. The HCA layers provide cheaper global coverage over a much shorter cache.

Both paths also retain a local sliding-window branch for recent uncompressed tokens. This keeps detailed access to the newest context while the older context is compressed.

DeepSeek V4 reported 1M-context FLOP and KV-cache savings — Figure 23. Reported 1M-context efficiency numbers from the DeepSeek V4 paper, relative to DeepSeek V3.2. DeepSeek V4-Pro is reported at 27% of the single-token inference FLOPs and 10% of the KV-cache size; DeepSeek V4-Flash is reported at 10% of the FLOPs and 7% of the KV-cache size (Original source *Recent Developments in LLM Architectures*).

How Its Used in DeepSeek V4

DeepSeek V4 combines CSA/HCA with MLA-style compact entries and shared-KV attention. The public gallery summarizes this as MLA-style CSA/HCA because the model still uses compact compressed entries, while the new part is the sequence-length compression.

This attention-side change is separate from mHC. CSA/HCA changes how the model stores and attends over long histories. mHC changes how the residual path carries information around the attention and MoE sublayers.

Tradeoff

CSA/HCA reduces long-context cost by giving up some token-level detail in older history. This is a stronger compression tradeoff than MLA-style per-token latent caching because the cache becomes shorter in addition to narrower.

The DeepSeek V4 paper reports strong overall results, but those results come from the full DeepSeek V4 recipe, including data, optimization, mHC, precision choices, and system-level implementation. I would treat CSA/HCA as a long-context efficiency design rather than a general replacement for MLA.

Sources

Recent Developments in LLM Architectures DeepSeek V4 technical report MLA explainer DeepSeek Sparse Attention explainer A Visual Guide to Attention Variants

Back to architecture gallery