Local Open-Weight LLMs in Coding Harnesses
Have been taking different local open-weight LLMs for a test drive in different harnesses (Qwen-Code, Codex, Claude Code).
30B Mixture-of-Experts models are kind of a nice sweet spot and can solve challenging problems. And they get roughly 40 tok/sec on a Mac or DGX Spark, which is similar to GPT 5.5 in a Pro subscription and totally usable for everyday work.
More interesting is also the harness choice! Claude Code seems to be using 2x as many tokens as Codex.
Gemma 4 E2B is here just for reference to show that the tasks can’t be trivially solved by smaller models.
The longer write-up is now available at Using Local Coding Agents.
Source: lightly edited website version of my Substack note.
Read Next
GLM-5.2 and IndexShare for Long-Context Sparse Attention
Short note on GLM-5.2, an open-weight GLM update that keeps the GLM-5 sparse MoE backbone and adds IndexShare for cheaper 1M-token DSA inference.
VibeThinker-3B and the Strength of Post-Training
Short note on VibeThinker-3B, a 3B model based on Qwen2.5-Coder-3B whose reported coding and reasoning results point to strong post-training.
North Mini Code and Agentic Coding Benchmarks
Short note on North Mini Code, Cohere's 30B total and 3B active open-weight MoE model for agentic coding tasks.
