This week’s picks share a throughline: the shift from “vibe-driven” AI use toward deliberate engineering discipline — guardrailed loops, skills-as-docs, evals as acceptance criteria, and context loaded with intent.

  • AI Updates Weekly — February 16, 2026 (Lev Selector) — “Skills-in-the-middle” reframes app development: move middle-layer logic out of hard-coded decision trees and into auditable Markdown skill files. The catch is governance: skills become a supply-chain problem — enable least-privilege, review what any skill can touch, and anticipate needing a “skills security officer” before long. The tagline says it all: “You are becoming a skills developer, not a software developer.”

  • You are Not Ready: Agentic coding in 2026 — A clear maturity ladder for agentic coding: chat → IDE suggestions → in-the-loop agentic (babysitting, learning failure modes) → out-of-the-loop (spec → delegate → verify) → multi-agent orchestration. The core pitch: you don’t need AGI, you need persistence + verification — a guardrailed loop that runs tools, checks deterministically, and retries. Don’t jump to multi-agent until you can reliably pass a checker; skipping stages just amplifies risk.

  • The senior engineer’s guide to AI coding: Context loading, custom hooks, and automation — Three reliability levers: (1) preload high-signal context (Mermaid diagrams beat rule lists — they give the agent a “flow map” of the system); (2) encode recurring prompts as aliases/CLIs so the workflow is repeatable; (3) add stop hooks that run typecheck/lint after the agent finishes and re-feed failures until it’s actually shippable. “Done” should mean “validated.” And if the conversation drifts: export it, get a second-model critique, reset and restart.

  • DIY dev tools: How this engineer created “Flowy” to visualize his plans and accelerate coding — An “artifacts over vibes” story: plans, flowcharts, and mockups become editable JSON that both the human and the model can read and write back to — a shared intermediate language. Skills become the place you codify what worked so the next run starts smarter, not from scratch. Bonus: a second model in “staff engineer reviewer” mode diffs the implementation against the plan artifact, catching mismatches and smells the primary agent missed.

  • 435: How to Actually Use Claude Code to Build Serious Software — The headline move: make the agent see the running app (browser-in-the-loop via a Chrome bridge) so it stops guessing on UI work. Pair that with strict allow/deny permissions (deny database-wiping commands; watch for scripts that bypass denials) and a completion-promise loop that keeps iterating until checks pass. Most of the value is in “the code it doesn’t generate so you don’t have to throw it away.”

  • Shipping AI That Works: An Evaluation Framework for PMs (Distilled) — “Vibes” don’t scale. Treat evals like tests for non-deterministic systems: collect representative examples, run A/B experiments over a dataset, use a judge/human calibration loop. The minimum viable eval loop is 10–50 examples, 2–4 discrete labels, and a calibration pass against human raters. The punchline: ship features with evals as acceptance criteria, not a PRD that vanishes after kickoff.

  • Context Engineering: Connecting the Dots with Graphs — Stephen Chin (Neo4j) — Prompts aren’t the whole game. Context engineering is a pipeline: instruction prompts + retrieval + memory + structured outputs. GraphRAG earns its complexity when you need multi-hop answers (people ↔ events ↔ artifacts) and inspectable evidence — you can see which subgraph was retrieved and layer access control before anything hits the model. Ties naturally to Agent Memory Systems.

  • Why maintaining a codebase is so damn hard – with Oh My Zsh creator Robby Russell (Podcast #207) — Rewrites often delete hidden knowledge baked into existing systems by years of incidents; the promise of a rewrite also demotivates tending today’s code. Guardrails added after incidents have a half-life — periodically prune what no longer pays its cycle-time cost. Keep diffs small (AI-assisted or not) so review stays possible. Useful framing: “How long for a typo to reach production?” — that number exposes your real workflow drag.