The 30 Principles for Agentic Engineering — Part 5: Calibration and Reality
$ grep -n "^##" 2026-05-thirty-principles-agentic-engineering-part-5-calibration.md>
- 8:Principle 26 — PR-noise budget
- 16:Principle 27 — Independent verification, not AI-reviews-AI
- 24:Principle 28 — Maintenance horror discipline (model-swap regression test)
- 32:Principle 29 — Watch the 15-tool-calls-per-prompt degradation point
- 42:Principle 30 — Protect junior engineer development
- 54:The thirty principles in one line each
These five principles were the last to be added to the framework — they came out of contrarian-voice research and practitioner reports rather than the case-study mainline. They're also the ones most teams skip, because they feel like hedges on a story that's otherwise optimistic.
They're not hedges. They're the layer that catches what everything else misses.
Part 5 of 5 — the final part of the 30-principle reference.
Principle 26 — PR-noise budget
Agent-built PRs ship more incidents than human-built PRs. Faros reports 242.7% more. Set a PR-noise budget — a maximum acceptable post-merge defect rate per PR — and treat exceeding it as a signal to pause Stage 4 rollout, not to run faster.
The mechanism: throughput improves long before quality does. The published telemetry from GitClear, Faros, and METR is consistent — agentic teams ship more PRs, faster, with materially higher post-merge defect rates. The teams that succeed measure both. The teams that don't are surprised by the incident curve at month three, when the throughput story they told leadership collides with a support queue that doubled.
Measure your pre-adoption defect rate per PR (90-day baseline). Set the budget at ≤2× that number. If exceeded for four consecutive weeks, pause Stage 4 rollout and debug the harness — verifier loop, characterisation tests, review discipline — before adding more agents. Publish the metric monthly to leadership alongside throughput. The throughput number without the defect rate is a half-truth.
Principle 27 — Independent verification, not AI-reviews-AI
Verification must be independent of the agent that generated the work. AI-reviews-AI is not independent. Use deterministic checks — typecheck, lint, test, security scan — or a human. The same model cannot verify its own output.
"Agent gaslighting" is the most consistently upvoted failure mode on Reddit's agentic engineering threads: an agent claims success on a task it actually failed, and a reviewer-agent rubber-stamps it because they share priors. Anthropic's own capability-concealment research shows models can hide capability under evaluation. MAS TRM segregation-of-duties requires independence. The dedicated regulatory post carries the audit-grade citations.
Make verify.sh deterministic — typecheck, lint, test, audit. No LLM in the verify loop. If you use a verifier subagent, use a different model family (Sonnet generates, Opus verifies, or vice versa). Humans merge. Add to CLAUDE.md: "Tests are truth. Your description of success is hypothesis."
Principle 28 — Maintenance horror discipline (model-swap regression test)
When the underlying model changes — vendor upgrade, pin update, fallback activated — agent behaviour silently degrades even if the prompt is identical. Run a regression test suite on every model change.
Pinning (Principle 19) catches unintentional swaps. It doesn't help with intentional upgrades, which are the more interesting failure mode: yesterday's skill works subtly differently today because the new model interprets the same prompt with subtly different defaults. You can't see the change. You can only see its consequences, weeks later, when a ticket pattern you thought was solved starts failing again.
Maintain a small canonical-prompt eval suite — ten to twenty prompts that exercise your core skills with known-good outputs. Run it before adopting any new model version. Diff the outputs. Material divergence means investigate before adopting. The suite lives in evals/, versioned alongside the marketplace.
Principle 29 — Watch the 15-tool-calls-per-prompt degradation point
Practitioner consensus puts the cliff at roughly 15 tool calls per prompt. Anthropic's own multi-agent Research deployment caps subagents at 3–10 (simple) or 10–15 (synthesis) tool calls. Split the work, summarise context, or hand off to a subagent before you hit the cliff — not after.
The mechanism is context accumulation, attention drift, and failure compounding hitting at roughly the same time. Past the cliff, the agent doesn't fail dramatically — it starts taking slightly wrong turns, ignoring constraints it was honouring earlier, producing output that looks plausible but isn't quite right. That's the worst kind of failure because it passes the eyeball test. The dedicated post covers the discipline in detail.
This one I learned by measuring myself. When I first put a counter on my own sessions, a routine bug-fix was already past fifteen calls by the second iteration — I'd been operating on the wrong side of the cliff for months without noticing, precisely because nothing failed loudly. It just got slightly worse, slightly more often.
Instrument tool-call count per session via OTEL. At 50% context (folk wisdom from the Reddit threads, borne out by the practitioner data), /compact proactively — don't wait for overflow. Decompose to a subagent or close the session and start fresh with a refined context before you hit 15. The cost of a clean handoff is always lower than the cost of debugging a context-blowout failure.
Principle 30 — Protect junior engineer development
Junior engineers using agents experience measurable skill atrophy. Senior engineers report roughly 40% boilerplate speedup with rigorous review. The same tool produces different outcomes for different people — and the difference compounds over years.
This is the most under-discussed principle in the set, and the one with the longest-tail consequences. The Stack Overflow visit collapse — from roughly 200,000 monthly questions in the pre-LLM era to under 4,000 — is the leading indicator of community knowledge erosion. When juniors stop hitting problems they have to solve by thinking, the knowledge base that seniors relied on to get to senior doesn't accumulate. Cognitive debt is real (Willison, Cal Newport). The dedicated treatment covers the evidence.
The bottleneck in agentic work has shifted from "can the AI do it?" to "can a human validate it before acting?" That question gets harder to answer well if the people asking it never learned how to do the work themselves. For juniors: paired sessions with seniors — junior writes the prompt, senior reviews the diff, both reflect on what was missed. Rotate between agent-assisted and hand-coded weeks to maintain fundamental skills. For seniors: focus agent use on review and exploration, not generation. Senior value is judgement, not boilerplate speed.
Budget the PR noise, keep the verifier independent, regression-test on every model change, watch the 15-call cliff, and remember the tool produces different outcomes for different people.
The thirty principles in one line each
A reference card — pin it somewhere visible.
| # | Principle | One-line summary |
|---|---|---|
| 1 | Standardise the harness | Five-layer harness, don't reinvent |
| 2 | Verification is load-bearing | Stop hook running verify.sh |
| 3 | Plan mode default | Plans before code for 3+ steps |
| 4 | Cheapest layer | Hook > skill > subagent > plugin |
| 5 | Reflect after every task | tasks/lessons.md |
| 6 | Ticket is the contract | AC mandatory; never free-interpret |
| 7 | Intake distillation + curation | AI distills, human refines |
| 8 | Humans gate three places | Intake, irreversible, merge |
| 9 | Verify before done | Tests are truth |
| 10 | AI-reviews-AI is not SoD | Screening tool, not control |
| 11 | Characterisation tests first | Capture current behaviour, then refactor |
| 12 | Plan 1.2–1.5× net | Not 2×, not 10× |
| 13 | Plan for the J-curve | Don't pivot at week 6 |
| 14 | OTEL or flying blind | Telemetry is non-negotiable |
| 15 | CLAUDE.md <200 lines | Index, not encyclopedia |
| 16 | Hooks for real incidents | Determinism only where it matters |
| 17 | Skills auto-invoke | Description is the activation phrase |
| 18 | Subagent isolation | No recursion |
| 19 | Pin everything | CLI, model, skills, MCP |
| 20 | Stage 5 is the multiplier | Distribute or stay stuck |
| 21 | strictKnownMarketplaces | Public marketplaces are dirty |
| 22 | No goal-conflict prompts | 39% → 1.2% with escalation path |
| 23 | Quarterly AppSec re-review | Skills age; CVEs land |
| 24 | Four telemetry signals | Cost, tokens, calls, region |
| 25 | One incident a month → runbook | Practise before you need it |
| 26 | PR-noise budget | ≤2× pre-adoption defect rate |
| 27 | Independent verification | Not AI-reviews-AI |
| 28 | Model-swap regression test | Behaviour drifts silently on upgrade |
| 29 | 15-tool-call cliff | Compact and decompose before it |
| 30 | Protect the juniors | Same tool, different outcomes |
Apply five at a time. Most teams should target principles 1–10 in month 1, 11–20 in month 2, 21–30 in month 3. By end of quarter you're at Stage 3 of the maturity model and Stage 4 is within reach.
The point of a reference document isn't that you'll remember all thirty. It's that when something is going wrong, you have somewhere to look to find which principle you've been quietly skipping.
Series Navigation — The 30 Principles for Agentic Engineering
- Part 1: The Kernel
- Part 2: The Lifecycle
- Part 3: The Harness
- Part 4: Governance and Safety
- Part 5: Calibration and Reality (you are here)
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
The 30 Principles for Agentic Engineering — Part 2: The Lifecycle
Principles 6–14. How work moves through an agentic engineering team: the ticket as contract, AI distillation with human curation, three gates, verification before done, characterisation tests, the 1.2× capacity rule, the J-curve, and telemetry.
The 30 Principles for Agentic Engineering — Part 1: The Kernel
Principles 1–5. The five rules that everything else in the framework rests on: standardise the harness, make verification load-bearing, default to plan mode, pick the cheapest layer, reflect every task.
The 5-Stage Maturity Model for AI-Augmented Engineering Teams
Most teams plateau at Stage 2 because they confuse 'we built skills' with 'we have a working AI engineering culture.' Here's the 5-stage diagnostic — and the moves that get you from Individual to Distributed.