GLM-5.2 and IndexShare for Long-Context Sparse Attention
GLM-5.2 is a recent open-weight model release from Z.ai. My first impression is that it is the best open-weight model today. As usual for fresh releases, I would treat the release-time leaderboard position as date-sensitive.
Architecture-wise, it builds on the earlier GLM-5 and GLM-5.1 architecture. In particular, it reuses Multi-head Latent Attention and DeepSeek Sparse Attention, the DSA mechanism from DeepSeek V3.2 that I covered in the DeepSeek V3 to V3.2 article.
What’s new is IndexShare. This is a cross-layer reuse trick for DSA. Instead of recomputing the sparse-attention top-k indexer in every layer, GLM-5.2 runs the full indexer only once every four layers. The following layers then reuse the selected token indices.
This keeps the same DSA idea but makes 1M-token inference cheaper. The attention pattern is still adaptive, but the model spends less work repeatedly deciding which earlier tokens to attend to.
The local GLM-5.2 architecture card has the current summary, config links, and benchmark references.
Source: lightly edited website version of my Substack note.
Read Next
VibeThinker-3B and the Strength of Post-Training
Short note on VibeThinker-3B, a 3B model based on Qwen2.5-Coder-3B whose reported coding and reasoning results point to strong post-training.
North Mini Code and Agentic Coding Benchmarks
Short note on North Mini Code, Cohere's 30B total and 3B active open-weight MoE model for agentic coding tasks.
Nemotron 3 Ultra and Latent MoE Scaling
Short note on Nemotron 3 Ultra, NVIDIA's 550B total and 55B active hybrid Mamba-Transformer Latent MoE model.
