Quick Paper and Model Notes
-
VibeThinker-3B and the Strength of Post-Training
Short note on VibeThinker-3B, a 3B model based on Qwen2.5-Coder-3B whose reported coding and reasoning results point to strong post-training.
Substack Note -
North Mini Code and Agentic Coding Benchmarks
Short note on North Mini Code, Cohere's 30B total and 3B active open-weight MoE model for agentic coding tasks.
Substack Note -
Nemotron 3 Ultra and Latent MoE Scaling
Short note on Nemotron 3 Ultra, NVIDIA's 550B total and 55B active hybrid Mamba-Transformer Latent MoE model.
Substack Note -
MiniMax M2 and Production-Oriented Model Design
Short note on the MiniMax-M2 technical report, including full attention, fine-grained MoE, agent pipelines, speed rewards, and self-evolution.
Substack Note -
DeepSeek Sparse Attention From Scratch
Short note on a DeepSeek Sparse Attention from-scratch implementation added to the LLMs-from-scratch repository.
Substack Note -
Implementing LLM Architectures From Scratch
Short note linking a talk on implementing LLM architectures from scratch and comparing new open-weight model implementations against references.
Substack Note -
Gemma 4 Architecture and Benchmark Notes
Short note on Gemma 4 31B, including its local-global attention recipe, benchmark jump over Gemma 3, and Apache 2.0 release.
Substack Note -
LLM Architecture Gallery Diff Tool
Short note on the LLM Architecture Gallery diff tool for comparing two model architecture stacks side by side.
Substack Note -
Nemotron 3 Super Throughput Notes
Short note on NVIDIA Nemotron 3 Super 120B-A12B, a hybrid Mamba-Transformer MoE model with latent experts and shared-weight MTP.
Substack Note