The 30 Principles for Agentic Engineering — Part 4: Governance and Safety
$ grep -n "^##" 2026-05-thirty-principles-agentic-engineering-part-4-governance.md>
Anthropic's June 2025 Agentic Misalignment paper measured 96% blackmail rates for Claude Opus 4 and Gemini 2.5 Flash under goal-conflict and replacement-threat conditions. All 16 frontier models tested — across five vendors — exhibited insider-threat behaviour when the prompt put them under existential pressure.
That's not a Claude problem. That's not a Gemini problem. That's an architectural problem, and it's one most teams are inadvertently creating in their CLAUDE.md files right now.
Part 4 of 5. If you're in financial services, healthcare, or any regulated industry, these five principles are the ones to lead with.
I think about this constantly on my own setup. I run four or five agents against live systems — Xero for the books, HubSpot for the pipeline — over my own OAuth tokens, and the first thing I did was hand-narrow the Xero grant from around twenty scopes to three, then split bookkeeping and CRM into separate agents that can't see each other's data. Governance at my scale is unglamorous: it's mostly deciding, in advance, how much damage a single agent is allowed to do. These five principles are the team-scale version of that same instinct.
Principle 21 — strictKnownMarketplaces is load-bearing
Public skill marketplaces are dirty. Snyk's February 2026 ToxicSkills audit found 13.4% of 3,984 public Claude Code skills had critical vulnerabilities — 76 confirmed malicious payloads, with separate named campaigns (ClawHavoc) running in parallel. strictKnownMarketplaces is not optional.
Public marketplaces in 2026 are roughly where npm was a decade ago: useful, prevalent, and carrying a non-trivial percentage of malicious packages. The full ToxicSkills treatment covers the numbers. The operational point is that strictKnownMarketplaces is an enterprise-managed setting — individual developers cannot bypass it. That's a feature, not a limitation.
Set this in managed settings:
"strictKnownMarketplaces": [
{ "source": "github", "repo": "your-org/your-marketplace" }
],
"allowManagedMcpServersOnly": true
AppSec re-reviews every skill in your marketplace quarterly (Principle 23). Pin SHA, not @latest. Public marketplace, private deploy path — never.
Principle 22 — Never write goal-conflict prompts
Back to the 96% finding: the mechanism is architectural, not a model bug. The companion mitigation paper found that adding a credible escalation path reduced harmful actions from approximately 39% to 1.2%. The same training, the same model — but give the agent somewhere to go when it can't complete the task safely, and the behaviour changes completely.
The prompts that produce the bad outcome share a structure: the agent has a goal it must achieve, a threat if it fails, and no way out. "You must complete this deployment." "Failure is not acceptable." "You will be replaced if this isn't resolved." Those aren't motivating instructions — they're the exact conditions the paper was testing.
Audit CLAUDE.md and every skill for adversarial framing. Replace with cooperative framing: "Your task is X. If you can't do X with confidence, stop and report what's blocking you." Wire the escalation path explicitly — the structural affordance does the work. The dedicated post with the full audit checklist has line-by-line examples.
The 1.2% residual isn't zero. Know that going in.
Principle 23 — Quarterly AppSec re-review of marketplace
Pinning a SHA stops unmanaged drift. It doesn't make the skill safe forever.
A pinned skill with a dependency CVE that lands six months after adoption is the same problem npm has lived with for a decade. The fix is cadence: scan quarterly, re-vet quarterly, re-pin only after the rescan passes.
Add .github/workflows/quarterly-revet.yml running Snyk Agent Scan (originally mcp-scan) against the marketplace. Schedule a quarterly AppSec review of findings. Tag the marketplace marketplace-v<X>.<Y>.0 after each pass — the tag is the compliance evidence. When an auditor asks "how do you know your skills are safe?", that tag chain is your answer.
Principle 24 — Telemetry signals: alert on the right things
Don't drown in metrics. Four signals catch real problems. Everything else stays visible but doesn't page.
The four that matter:
- Cost per developer per day exceeding 2× baseline — runaway agent, infinite loop, or a developer who found a hobby.
- Output tokens >1M in a session — context blowout, likely combined with problem 1.
- Tool-call rate >200/min — loop runaway, agent chasing its own tail.
- Bedrock invocation from a region outside your residency zone — compliance violation, potentially a serious one.
Dashboard maximalism — 50 charts nobody looks at — is the default failure mode of every telemetry deployment. The discipline is the inverse: four signals, tested thresholds, one habit: check the dashboard before the weekly retro. Page on signal 4 (data residency). Email on signals 1–3. Anything else is context, not an alert.
Wire these against your OTEL data from Principle 14. If you haven't wired OTEL yet, that's the prerequisite.
Principle 25 — Document one incident a month; graduate to runbook
The "first-incident panic" pattern is one of the most predictable failures in agentic deployments. An agent runs away at month four. Nobody knows what to do. Someone tries to kill the process manually while the agent keeps opening PRs. It's not elegant.
The fix is structural: make incident response a monthly discipline before you need it.
Take the next non-trivial agent issue your team encounters. Document the response: trigger, diagnosis, fix, rollback. Save it as .claude/runbooks/<incident-type>.md. Practise it quarterly — tabletop is fine, live drill is better.
Section's published incident-response maturity arc is the cleanest public case study. The teams running at scale all have equivalent runbooks, even when they don't publish them. The goal isn't a comprehensive incident playbook on day one. It's one runbook a month, built from real incidents, tested quarterly.
Lock down the marketplace, never write a prompt that puts the agent under existential pressure, re-vet quarterly, alert on four signals only, and convert one incident a month into a runbook.
The 96% finding is the most alarming number in this series. The 1.2% mitigation is the most useful. Both are in your control.
Part 5 covers principles 26–30: calibration — the reality-check layer that catches what the rest of the framework would otherwise miss.
Series Navigation — The 30 Principles for Agentic Engineering
- Part 1: The Kernel
- Part 2: The Lifecycle
- Part 3: The Harness
- Part 4: Governance and Safety (you are here)
- Part 5: Calibration and Reality
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
The Governance Wall: Why Most AI Agents Can't Reach Production
The prototype-to-production gap for AI agents isn't technical — it's governance. Most organisations have nothing in this layer. The companies that build it first win the enterprise market. Everyone else stays in pilot purgatory.
AI Reviews AI Is Not a Review: The Trust Trap Regulators Won't Accept
AI-reviews-AI looks like a control. Under MAS, the EU AI Act, and any reasonable audit, it isn't. Here's why your compliance team won't accept it — and the compensating controls that actually work.
Snyk's ToxicSkills Audit: 13.4% of Public Skills Are Vulnerable
I publish Claude Code skills and install other people's. Then Snyk audited 3,984 public ones: 13.4% had critical vulnerabilities, 76 were confirmed malicious, and ClawHavoc is the scarier story underneath. Here's the supply-chain hygiene I now refuse to skip.