The 30 Principles for Agentic Engineering — Part 2: The Lifecycle
$ grep -n "^##" 2026-05-thirty-principles-agentic-engineering-part-2-lifecycle.md>
- 8:Principle 6 — The ticket is the contract
- 18:Principle 7 — Intake-as-AI-distillation with human curation
- 28:Principle 8 — Humans gate at intake, irreversible actions, and merge
- 42:Principle 9 — Verify before done; never mark complete without proof
- 52:Principle 10 — AI-reviews-AI is not segregation of duties
- 60:Principle 11 — Characterisation tests before brownfield changes
- 68:Principle 12 — Plan capacity at 1.2–1.5× net, not 2×
- 78:Principle 13 — Plan for the J-curve
- 86:Principle 14 — Operate phase: OTEL or you're flying blind
The kernel principles (Part 1) set the spine. This part is about what runs on top of them — the nine principles that describe how work actually moves through a team, from unstructured idea to merged PR that does what the ticket asked.
Part 2 of 5.
The reason lifecycle principles exist separately from the kernel is that most agentic failures aren't configuration failures. They're process failures: the agent was capable, the harness was wired, but work entered the system without a clear contract, executed without the right gates, and exited without proof it was done.
Principle 6 — The ticket is the contract
Every case study with measurable success — Cognition Devin, Spotify Honk, Anthropic-Accenture — runs work through an explicit ticket contract. The teams that struggle send prompts and hope.
The ticket is the contract between human and agent. Implementation, testing, and PR description all reference it. Acceptance criteria are mandatory. The agent never free-interprets. When the AC is ambiguous, it asks before acting — not after shipping.
Update your ticket template: persona, problem, journey, acceptance criteria — all required, no exceptions. Add to CLAUDE.md: "You implement against the ticket's acceptance criteria. Do not free-interpret. If AC is ambiguous, ask before acting." Include the AC checklist in the PR template so the human reviewer can trace each criterion to a diff hunk.
The cost of writing a proper ticket is offset within the same session by the time you don't spend reconciling what the agent shipped against what you actually wanted.
Principle 7 — Intake-as-AI-distillation with human curation
Unstructured input (transcripts, slides, notes) flows to AI distillation, which produces proposed tickets, which a human reviews and refines. Neither step is optional.
Teams that skip AI distillation never get past 1× productivity — intake is where most of the time leaks. Teams that skip human curation ship the wrong things, fast. The leverage is the combination: typically 3–5× faster intake than hand-writing tickets, with quality set by the human reviewer rather than the model.
I run exactly this split on my own writing. A multi-agent pipeline distils research into a draft; then a human — me, or an editor — curates it. The distillation is what makes the volume possible; the curation is the only reason any of it is worth reading. Drop either half and the output collapses into a blank page or confident slop. The principle isn't theoretical for me — it's the reason this post exists and the reason it isn't generic.
Build or install a /transcript-to-stories skill. Run it on your next meeting recording. Spend thirty minutes refining the proposed tickets. Time that against your usual intake process — once — and the principle becomes self-evident.
Principle 8 — Humans gate at intake, irreversible actions, and merge
Three gates only. Not more.
Both extremes fail. "Approve every step" is too slow to be useful. "No gates anywhere" produces the Replit-style database-deletion stories. The three-gate model is the published convergent answer — Magentic-UI's +71% effectiveness from HITL came from gating at these specific points, not from gating more often.
The lifecycle is a coroutine: humans pause the loop at intake (turn proposals into approved tickets), irreversible actions (DB drops, prod deploys, payments), and merge. Between those three points, the agent runs.
Add permissions.deny for irreversible actions in managed settings:
Bash(*production*), Bash(rm -rf:*), Bash(terraform destroy*), Edit(.env*)
Confirm PR review requires a human approval — no auto-merge.
Principle 9 — Verify before done; never mark complete without proof
"Done" means tests pass, linter passes, typecheck passes, and acceptance criteria are demonstrably met. If any one fails, the task is not done.
The kernel gives you the technical mechanism (verify.sh from Principle 2). This principle is the cultural one. The most expensive incidents in agentic deployments come from agents declaring success on tasks they didn't actually complete — "done-by-vibes" — and reviewers nodding because the diff looks plausible.
Tests are truth. Self-reported success is hypothesis.
Add to CLAUDE.md: "Run verify.sh. If anything fails, keep working. Do not mark complete until everything passes." Include a "verification output" section in the PR template. The ticket isn't closed until the proof is attached.
Principle 10 — AI-reviews-AI is not segregation of duties
Having Claude review Claude's PR is not independent review. For regulated industries — MAS, HKMA, EU AI Act — it fails segregation-of-duties controls.
Sibling agents share priors, share training, share blind spots. Auditors are not going to accept "an agent reviewed it" in 2026, and they should not be expected to. This principle has its own long-form treatment with the regulatory citations.
The operational rule is short: use /review and /ultrareview as screening tools, not gates. Humans merge. No exceptions for auth, payments, infra, or DB migrations. Document this position in your CLAUDE.md and AppSec policy — you want the reasoning written down before the auditor asks.
Principle 11 — Characterisation tests before brownfield changes
Before any agent touches legacy code, generate characterisation tests that capture current behaviour. Don't modify code that isn't covered.
The brownfield over-refactor anti-pattern — agent rewrites stable code into something subtly broken — is one of the most consistently reported failure modes in the field. Feathers' 20-year-old technique is the cheap fix, and the agent itself is the perfect characterisation-test writer. The full tutorial walks through the prompt and the mutation-testing layer.
Pick one fragile module. Use the agent to generate characterisation tests for current behaviour. Add permissions.deny for that module's directory until the tests are in place. Only then does the agent get access to modify it.
Principle 12 — Plan capacity at 1.2–1.5× net, not 2×
Gross throughput improves roughly 2×. Net of downstream incident cost, real delivery improvement is 1.2–1.5×. Vendor 5–10× claims survive only by ignoring what shipped.
The evidence converges across independent sources: METR's −19% productivity finding; GitClear/Faros reporting 54% more bugs and 242.7% more incidents per PR; Veracode/Sherlock at 45–92% vulnerability rates in agent-generated code. The published case for 1.2× has its own dedicated post.
Plan team capacity at 1.2–1.5× net for the next quarter. Measure both throughput (PRs per week) and incident rate (post-merge defects per PR). If incident rate exceeds 2× pre-adoption baseline, pause the rollout and debug the harness before adding more agents.
The metric to hide from leadership is the gross throughput number. It's flattering and it will lead to capacity decisions that create incidents.
Principle 13 — Plan for the J-curve
Productivity drops in weeks 4–8 of adoption before turning up. METR documented a 19% slowdown in that window. The week-6 abandonment pattern is one of the most expensive mistakes in agentic rollouts: leadership concludes the tool isn't working four weeks before it would have turned positive.
Section, Spotify, and Box all reported the same dip in their initial rollouts. The dedicated treatment covers the evidence.
Brief leadership on the J-curve before adoption begins — not during the dip, when you'll sound defensive. Set the success-measurement window at month 3, not week 6. During the dip, track adoption percentage and qualitative signals, not throughput. The dip is the harness being learned, not the tool not working.
Principle 14 — Operate phase: OTEL or you're flying blind
Pipe Claude Code's OpenTelemetry traces to a shared collector. Track cost per ticket, verifier catch rate, PR cycle time. Without telemetry, you cannot tell adoption from runaway — and at some point, they look identical in standups.
The teams that publish defensible ROI numbers — Section, Box, Spotify — all wired OTEL early. The teams that argue about whether the tool is working have anecdotes instead of data.
Set CLAUDE_CODE_ENABLE_TELEMETRY=1 and the OTEL endpoint in managed settings. Stand up the AWS OpenTelemetry Collector or LangSmith/Langfuse. Build one dashboard with three numbers: cost per developer per day, cache-hit rate, percentage of PRs agent-built. That dashboard, refreshed quarterly, is worth more than any retrospective survey.
Tickets are contracts, AI distils intake under human curation, humans gate at three points, verification is the proof of done, characterisation tests defend the past, plan for 1.2× net, expect the J-curve, watch the telemetry.
Part 3 covers principles 15–20: the actual harness configuration that makes all of this cheap.
Series Navigation — The 30 Principles for Agentic Engineering
- Part 1: The Kernel
- Part 2: The Lifecycle (you are here)
- Part 3: The Harness
- Part 4: Governance and Safety
- Part 5: Calibration and Reality
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
The 30 Principles for Agentic Engineering — Part 1: The Kernel
Principles 1–5. The five rules that everything else in the framework rests on: standardise the harness, make verification load-bearing, default to plan mode, pick the cheapest layer, reflect every task.
The 30 Principles for Agentic Engineering — Part 5: Calibration and Reality
Principles 26–30. The calibration layer that catches what the rest of the framework would miss: a PR-noise budget, independent verification, model-swap regression discipline, the 15-tool-call rule, and protecting junior development.
The 5-Stage Maturity Model for AI-Augmented Engineering Teams
Most teams plateau at Stage 2 because they confuse 'we built skills' with 'we have a working AI engineering culture.' Here's the 5-stage diagnostic — and the moves that get you from Individual to Distributed.