The 30 Principles for Agentic Engineering — Part 5: Calibration and Reality
Final part of the 30-principle reference. The kernel sets the spine, the lifecycle moves work through it, the harness configures the layers, governance keeps it defensible. This part is the calibration — the reality-check layer that catches what the rest of the framework would otherwise miss.
These five principles were the last to be added — they came out of contrarian-voice research and the Reddit practitioner reports rather than the case-study mainline. They are also the ones most teams skip. Don't.
Principle 26 — PR-noise budget
Statement. Agent-built PRs ship more incidents than human-built PRs (Faros reports 242.7% more). Set a PR-noise budget: maximum acceptable post-merge defect rate per PR before you pause Stage 4 rollout.
Why it matters. Throughput improves long before quality does. The published telemetry — GitClear, Faros, METR — is consistent: agentic teams ship more PRs, faster, with materially higher post-merge defect rates. The teams that succeed measure both. The teams that don't are surprised by the incident curve at month three. The budget is the mechanism that turns the surprise into a known threshold.
Tomorrow morning.
- Measure your pre-adoption defect rate per PR (90-day baseline).
- Set the budget: ≤2× pre-adoption defect rate.
- If exceeded for four consecutive weeks, pause Stage 4 rollout. Debug the harness (verifier loop, characterisation tests, review discipline) before adding more agents.
- Publish the metric monthly to leadership alongside throughput.
Principle 27 — Independent verification, not AI-reviews-AI
Statement. Verification must be independent of the agent that generated the work. AI-reviews-AI is not independent. Use deterministic checks (typecheck, lint, test, security scan) or a human. Do not use the same model to verify its own output.
Why it matters. "Agent gaslighting" is the most consistently upvoted failure mode on Reddit's agentic engineering threads — an agent claims success on a task it actually failed, and a reviewer-agent rubber-stamps because they share priors. Anthropic's own capability-concealment research shows models can hide capability under evaluation. MAS TRM segregation-of-duties requires independence. The dedicated regulatory post carries the audit-grade citations.
Tomorrow morning.
- Make
verify.shdeterministic — typecheck, lint, test, audit. No LLM in the verify loop. - If you use a verifier subagent, use a different model family (Sonnet generates, Opus verifies, or vice versa).
- Humans merge. No exception for sensitive paths.
- Add to
CLAUDE.md: "Tests are truth. Your description of success is hypothesis."
Principle 28 — Maintenance horror discipline (model-swap regression test)
Statement. When the underlying model changes — vendor upgrade, model pin update, fallback activated — agent behaviour silently degrades even if the prompt is identical. Run a regression test suite on every model change.
Why it matters. Pinning (Principle 19) catches unintentional model swaps. It doesn't help with intentional upgrades, which are the more interesting failure mode: yesterday's skill works subtly differently today because the new model interprets the same prompt with subtly different defaults. The fix is a small canonical-prompt eval suite — ten to twenty prompts that exercise your core skills with known-good outputs. Run it before adopting any new model. Diff the outputs.
Tomorrow morning.
- Maintain a small suite of canonical prompts and expected outputs.
- Run the suite before adopting any new model version.
- Diff outputs. Material divergence means investigate before adopting.
- The suite lives in
evals/, versioned alongside the marketplace.
Principle 29 — Watch the 15-tool-calls-per-prompt degradation point
Statement. Practitioner consensus puts the cliff at roughly 15 tool calls per prompt. Anthropic's own multi-agent Research deployment caps subagents at 3–10 (simple) or 10–15 (synthesis) tool calls. Split the work, summarise context, or hand off to a subagent before you hit the cliff.
Why it matters. The mechanism is context accumulation, attention drift, and failure compounding hitting at roughly the same time. The short-form post covers the discipline in detail. The operational rule: instrument the tool-call count, /compact proactively at 50% context, decompose to subagents before the threshold rather than after.
Tomorrow morning.
- Instrument tool-call count per session via OTEL.
- When approaching 15,
/compactproactively (Reddit folk wisdom: at 50% context, not at overflow). - Decompose: pass partial work to a subagent or close the session and start fresh with a refined context.
Principle 30 — Protect junior engineer development
Statement. Junior engineers using agents experience measurable skill atrophy. Senior engineers report roughly 40% boilerplate speedup with rigorous review. The same tool produces different outcomes per audience.
Why it matters. This is the most under-discussed principle in the set, and the one with the longest-tail consequences. The Stack Overflow visit collapse — from roughly 200,000 monthly questions in the pre-LLM era to under 4,000 — is the leading indicator of community knowledge erosion. Cognitive debt is a real and named phenomenon (Willison, Cal Newport). The dedicated treatment covers the evidence base.
Tomorrow morning.
- For juniors: paired sessions with seniors — junior writes the prompt, senior reviews the diff, both reflect on what was missed.
- For juniors: rotate between agent-assisted and hand-coded weeks to maintain fundamental skills.
- For seniors: focus agent use on review and exploration, not generation. Senior value is judgement, not boilerplate speed.
The calibration layer in one line
Budget the PR noise, keep the verifier independent, regression-test on every model change, watch the 15-call cliff, and remember the tool produces different outcomes for different people.
The thirty principles in one line each
A reference card you can print and pin somewhere visible.
| # | Principle | One-line summary |
|---|---|---|
| 1 | Standardise the harness | Five-layer harness, don't reinvent |
| 2 | Verification is load-bearing | Stop hook running verify.sh |
| 3 | Plan mode default | Plans before code for 3+ steps |
| 4 | Cheapest layer | Hook > skill > subagent > plugin |
| 5 | Reflect after every task | tasks/lessons.md |
| 6 | Ticket is the contract | AC mandatory; never free-interpret |
| 7 | Intake distillation + curation | AI distills, human refines |
| 8 | Humans gate three places | Intake, irreversible, merge |
| 9 | Verify before done | Tests are truth |
| 10 | AI-reviews-AI is not SoD | Screening tool, not control |
| 11 | Characterisation tests first | Capture current behaviour, then refactor |
| 12 | Plan 1.2–1.5× net | Not 2×, not 10× |
| 13 | Plan for the J-curve | Don't pivot at week 6 |
| 14 | OTEL or flying blind | Telemetry is non-negotiable |
| 15 | CLAUDE.md <200 lines | Index, not encyclopedia |
| 16 | Hooks for real incidents | Determinism only where it matters |
| 17 | Skills auto-invoke | Description is the activation phrase |
| 18 | Subagent isolation | No recursion |
| 19 | Pin everything | CLI, model, skills, MCP |
| 20 | Stage 5 is the multiplier | Distribute or stay stuck |
| 21 | strictKnownMarketplaces | Public marketplaces are dirty |
| 22 | No goal-conflict prompts | 39% → 1.2% with escalation path |
| 23 | Quarterly AppSec re-review | Skills age; CVEs land |
| 24 | Four telemetry signals | Cost, tokens, calls, region |
| 25 | One incident a month → runbook | Practise before you need it |
| 26 | PR-noise budget | ≤2× pre-adoption defect rate |
| 27 | Independent verification | Not AI-reviews-AI |
| 28 | Model-swap regression test | Behaviour drifts silently on upgrade |
| 29 | 15-tool-call cliff | Compact and decompose before it |
| 30 | Protect the juniors | Same tool, different outcomes |
That's the framework. Apply five at a time. Most teams should target principles 1–10 in month 1, 11–20 in month 2, 21–30 in month 3. By the end of the quarter you're at Stage 3 of the maturity model and Stage 4 is within reach.
The point of a reference document is not that you'll remember it all. It's that when something is going wrong, you have somewhere to look to find which principle you've been quietly skipping.
Series Navigation — The 30 Principles for Agentic Engineering
- Part 1: The Kernel
- Part 2: The Lifecycle
- Part 3: The Harness
- Part 4: Governance and Safety
- Part 5: Calibration and Reality (you are here)
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
The 30 Principles for Agentic Engineering — Part 2: The Lifecycle
Principles 6–14. How work moves through an agentic engineering team: the ticket as contract, AI distillation with human curation, three gates, verification before done, characterisation tests, the 1.2× capacity rule, the J-curve, and telemetry.
The 30 Principles for Agentic Engineering — Part 1: The Kernel
Principles 1–5. The five rules that everything else in the framework rests on: standardise the harness, make verification load-bearing, default to plan mode, pick the cheapest layer, reflect every task.
The 5-Stage Maturity Model for AI-Augmented Engineering Teams
Most teams plateau at Stage 2 because they confuse 'we built skills' with 'we have a working AI engineering culture.' Here's the 5-stage diagnostic — and the moves that get you from Individual to Distributed.