30 Agentic Engineering Principles — Part 5 Calibration

These five principles were the last to be added to the framework — they came out of contrarian-voice research and practitioner reports rather than the case-study mainline. They're also the ones most teams skip, because they feel like hedges on a story that's otherwise optimistic.

They're not hedges. They're the layer that catches what everything else misses.

Part 5 of 5 — the final part of the 30-principle reference.

Principle 26 — PR-noise budget

Agent-built PRs ship more incidents than human-built PRs. Faros reports 242.7% more. Set a PR-noise budget — a maximum acceptable post-merge defect rate per PR — and treat exceeding it as a signal to pause Stage 4 rollout, not to run faster.

The mechanism: throughput improves long before quality does. The published telemetry from GitClear, Faros, and METR is consistent — agentic teams ship more PRs, faster, with materially higher post-merge defect rates. The teams that succeed measure both. The teams that don't are surprised by the incident curve at month three, when the throughput story they told leadership collides with a support queue that doubled.

Measure your pre-adoption defect rate per PR (90-day baseline). Set the budget at ≤2× that number. If exceeded for four consecutive weeks, pause Stage 4 rollout and debug the harness — verifier loop, characterisation tests, review discipline — before adding more agents. Publish the metric monthly to leadership alongside throughput. The throughput number without the defect rate is a half-truth.

Principle 27 — Independent verification, not AI-reviews-AI

Verification must be independent of the agent that generated the work. AI-reviews-AI is not independent. Use deterministic checks — typecheck, lint, test, security scan — or a human. The same model cannot verify its own output.

"Agent gaslighting" is the most consistently upvoted failure mode on Reddit's agentic engineering threads: an agent claims success on a task it actually failed, and a reviewer-agent rubber-stamps it because they share priors. Anthropic's own capability-concealment research shows models can hide capability under evaluation. MAS TRM segregation-of-duties requires independence. The dedicated regulatory post carries the audit-grade citations.

Make verify.sh deterministic — typecheck, lint, test, audit. No LLM in the verify loop. If you use a verifier subagent, use a different model family (Sonnet generates, Opus verifies, or vice versa). Humans merge. Add to CLAUDE.md: "Tests are truth. Your description of success is hypothesis."

Principle 28 — Maintenance horror discipline (model-swap regression test)

When the underlying model changes — vendor upgrade, pin update, fallback activated — agent behaviour silently degrades even if the prompt is identical. Run a regression test suite on every model change.

Pinning (Principle 19) catches unintentional swaps. It doesn't help with intentional upgrades, which are the more interesting failure mode: yesterday's skill works subtly differently today because the new model interprets the same prompt with subtly different defaults. You can't see the change. You can only see its consequences, weeks later, when a ticket pattern you thought was solved starts failing again.

Maintain a small canonical-prompt eval suite — ten to twenty prompts that exercise your core skills with known-good outputs. Run it before adopting any new model version. Diff the outputs. Material divergence means investigate before adopting. The suite lives in evals/, versioned alongside the marketplace.

Principle 29 — Watch the 15-tool-calls-per-prompt degradation point

Practitioner consensus puts the cliff at roughly 15 tool calls per prompt. Anthropic's own multi-agent Research deployment caps subagents at 3–10 (simple) or 10–15 (synthesis) tool calls. Split the work, summarise context, or hand off to a subagent before you hit the cliff — not after.

The mechanism is context accumulation, attention drift, and failure compounding hitting at roughly the same time. Past the cliff, the agent doesn't fail dramatically — it starts taking slightly wrong turns, ignoring constraints it was honouring earlier, producing output that looks plausible but isn't quite right. That's the worst kind of failure because it passes the eyeball test. The dedicated post covers the discipline in detail.

This one I learned by measuring myself. When I first put a counter on my own sessions, a routine bug-fix was already past fifteen calls by the second iteration — I'd been operating on the wrong side of the cliff for months without noticing, precisely because nothing failed loudly. It just got slightly worse, slightly more often.

Instrument tool-call count per session via OTEL. At 50% context (folk wisdom from the Reddit threads, borne out by the practitioner data), /compact proactively — don't wait for overflow. Decompose to a subagent or close the session and start fresh with a refined context before you hit 15. The cost of a clean handoff is always lower than the cost of debugging a context-blowout failure.

Principle 30 — Protect junior engineer development

Junior engineers using agents experience measurable skill atrophy. Senior engineers report roughly 40% boilerplate speedup with rigorous review. The same tool produces different outcomes for different people — and the difference compounds over years.

This is the most under-discussed principle in the set, and the one with the longest-tail consequences. The Stack Overflow visit collapse — from roughly 200,000 monthly questions in the pre-LLM era to under 4,000 — is the leading indicator of community knowledge erosion. When juniors stop hitting problems they have to solve by thinking, the knowledge base that seniors relied on to get to senior doesn't accumulate. Cognitive debt is real (Willison, Cal Newport). The dedicated treatment covers the evidence.

The bottleneck in agentic work has shifted from "can the AI do it?" to "can a human validate it before acting?" That question gets harder to answer well if the people asking it never learned how to do the work themselves. For juniors: paired sessions with seniors — junior writes the prompt, senior reviews the diff, both reflect on what was missed. Rotate between agent-assisted and hand-coded weeks to maintain fundamental skills. For seniors: focus agent use on review and exploration, not generation. Senior value is judgement, not boilerplate speed.

---

Budget the PR noise, keep the verifier independent, regression-test on every model change, watch the 15-call cliff, and remember the tool produces different outcomes for different people.

The thirty principles in one line each

A reference card — pin it somewhere visible.

#	Principle	One-line summary
1	Standardise the harness	Five-layer harness, don't reinvent
2	Verification is load-bearing	`Stop` hook running verify.sh
3	Plan mode default	Plans before code for 3+ steps
4	Cheapest layer	Hook > skill > subagent > plugin
5	Reflect after every task	`tasks/lessons.md`
6	Ticket is the contract	AC mandatory; never free-interpret
7	Intake distillation + curation	AI distills, human refines
8	Humans gate three places	Intake, irreversible, merge
9	Verify before done	Tests are truth
10	AI-reviews-AI is not SoD	Screening tool, not control
11	Characterisation tests first	Capture current behaviour, then refactor
12	Plan 1.2–1.5× net	Not 2×, not 10×
13	Plan for the J-curve	Don't pivot at week 6
14	OTEL or flying blind	Telemetry is non-negotiable
15	`CLAUDE.md` <200 lines	Index, not encyclopedia
16	Hooks for real incidents	Determinism only where it matters
17	Skills auto-invoke	Description is the activation phrase
18	Subagent isolation	No recursion
19	Pin everything	CLI, model, skills, MCP
20	Stage 5 is the multiplier	Distribute or stay stuck
21	`strictKnownMarketplaces`	Public marketplaces are dirty
22	No goal-conflict prompts	39% → 1.2% with escalation path
23	Quarterly AppSec re-review	Skills age; CVEs land
24	Four telemetry signals	Cost, tokens, calls, region
25	One incident a month → runbook	Practise before you need it
26	PR-noise budget	≤2× pre-adoption defect rate
27	Independent verification	Not AI-reviews-AI
28	Model-swap regression test	Behaviour drifts silently on upgrade
29	15-tool-call cliff	Compact and decompose before it
30	Protect the juniors	Same tool, different outcomes

Apply five at a time. Most teams should target principles 1–10 in month 1, 11–20 in month 2, 21–30 in month 3. By end of quarter you're at Stage 3 of the maturity model and Stage 4 is within reach.

The point of a reference document isn't that you'll remember all thirty. It's that when something is going wrong, you have somewhere to look to find which principle you've been quietly skipping.

---

Series Navigation — The 30 Principles for Agentic Engineering

Part 1: The Kernel
Part 2: The Lifecycle
Part 3: The Harness
Part 4: Governance and Safety
Part 5: Calibration and Reality (you are here)

They're not hedges. They're the layer that catches what everything else misses.

Part 5 of 5 — the final part of the 30-principle reference.

#	Principle	One-line summary
1	Standardise the harness	Five-layer harness, don't reinvent
2	Verification is load-bearing	`Stop` hook running verify.sh
3	Plan mode default	Plans before code for 3+ steps
4	Cheapest layer	Hook > skill > subagent > plugin
5	Reflect after every task	`tasks/lessons.md`
6	Ticket is the contract	AC mandatory; never free-interpret
7	Intake distillation + curation	AI distills, human refines
8	Humans gate three places	Intake, irreversible, merge
9	Verify before done	Tests are truth
10	AI-reviews-AI is not SoD	Screening tool, not control
11	Characterisation tests first	Capture current behaviour, then refactor
12	Plan 1.2–1.5× net	Not 2×, not 10×
13	Plan for the J-curve	Don't pivot at week 6
14	OTEL or flying blind	Telemetry is non-negotiable
15	`CLAUDE.md` <200 lines	Index, not encyclopedia
16	Hooks for real incidents	Determinism only where it matters
17	Skills auto-invoke	Description is the activation phrase
18	Subagent isolation	No recursion
19	Pin everything	CLI, model, skills, MCP
20	Stage 5 is the multiplier	Distribute or stay stuck
21	`strictKnownMarketplaces`	Public marketplaces are dirty
22	No goal-conflict prompts	39% → 1.2% with escalation path
23	Quarterly AppSec re-review	Skills age; CVEs land
24	Four telemetry signals	Cost, tokens, calls, region
25	One incident a month → runbook	Practise before you need it
26	PR-noise budget	≤2× pre-adoption defect rate
27	Independent verification	Not AI-reviews-AI
28	Model-swap regression test	Behaviour drifts silently on upgrade
29	15-tool-call cliff	Compact and decompose before it
30	Protect the juniors	Same tool, different outcomes

The point of a reference document isn't that you'll remember all thirty. It's that when something is going wrong, you have somewhere to look to find which principle you've been quietly skipping.

---

Series Navigation — The 30 Principles for Agentic Engineering

Part 1: The Kernel
Part 2: The Lifecycle
Part 3: The Harness
Part 4: Governance and Safety
Part 5: Calibration and Reality (you are here)

The 30 Principles for Agentic Engineering — Part 5: Calibration and Reality

Principle 26 — PR-noise budget

Principle 27 — Independent verification, not AI-reviews-AI

Principle 28 — Maintenance horror discipline (model-swap regression test)

Principle 29 — Watch the 15-tool-calls-per-prompt degradation point

Principle 30 — Protect junior engineer development

The thirty principles in one line each

Related

The 30 Principles for Agentic Engineering — Part 2: The Lifecycle

The 30 Principles for Agentic Engineering — Part 1: The Kernel

The 5-Stage Maturity Model for AI-Augmented Engineering Teams

The 30 Principles for Agentic Engineering — Part 5: Calibration and Reality

Principle 26 — PR-noise budget

Principle 27 — Independent verification, not AI-reviews-AI

Principle 28 — Maintenance horror discipline (model-swap regression test)

Principle 29 — Watch the 15-tool-calls-per-prompt degradation point

Principle 30 — Protect junior engineer development

The thirty principles in one line each

Related

The 30 Principles for Agentic Engineering — Part 2: The Lifecycle

The 30 Principles for Agentic Engineering — Part 1: The Kernel

The 5-Stage Maturity Model for AI-Augmented Engineering Teams

Practical AI engineering, in your inbox

Related

The 30 Principles for Agentic Engineering — Part 2: The Lifecycle

The 30 Principles for Agentic Engineering — Part 1: The Kernel

The 5-Stage Maturity Model for AI-Augmented Engineering Teams

Practical AI engineering, in your inbox

Related

The 30 Principles for Agentic Engineering — Part 2: The Lifecycle

The 30 Principles for Agentic Engineering — Part 1: The Kernel

The 5-Stage Maturity Model for AI-Augmented Engineering Teams