The 5-Stage Maturity Model for AI-Augmented Engineering Teams
$ grep -n "^##" 2026-05-five-stage-maturity-model-ai-engineering-teams.md>
- 34:Stage 1: Individual — one engineer, one head
- 40:Stage 2: Reusable — the plateau where most teams stop
- 48:Stage 3: Enforced — where the harness starts paying back
- 54:Stage 4: Delegated — subagents complete tasks without babysitting
- 64:Stage 5: Distributed — where one team's work compounds across the others
- 76:So — where are you really?
Part 7 of 7 in the agentic-engineering series. Last week's post named the five layers of the agent harness. This one names the journey through them.
Where is your team really?
Most engineering leaders I speak to give the same answer, and most are wrong about it. They say "we've cracked AI." What they mean is that two or three engineers have a slick CLAUDE.md, the team Notion has a skills directory, and someone gave a demo at the all-hands. That's Stage 2 of a five-stage progression — a plateau, not a destination. The signature is in the DORA 2024 data: a 25% rise in AI adoption correlated with a 7.2% drop in delivery stability. Individual productivity up, team delivery down. The artefact exists; the outcome doesn't.
Most teams haven't even reached the agentic threshold. Stack Overflow's 2025 survey found 84% of developers use or plan to use AI tools — but only 31% use agents. The progression below isn't a vendor framework: it maps one-to-one onto the harness layers from last week, each stage adding exactly one. Find yourself on the map.
Rendering diagram...
| Stage | Signal you are here | Move-to-next trigger | Anti-pattern |
|---|---|---|---|
| 1. Individual | Workflow lives in one head | Anyone else asks "how do you do X?" | Hero engineer hoards knowledge |
| 2. Reusable | Skills exist but few use them consistently | Compliance inconsistent across team | "We built skills" mistaken for "working AI culture" |
| 3. Enforced | Standards fire without anyone remembering | Verification work feels repetitive | Trying to enforce policy via CLAUDE.md prose |
| 4. Delegated | Multi-agent tasks complete without babysitting | Each new repo starts from zero | Built subagents, never packaged them |
| 5. Distributed | New hire onboards in hours, not days | Terminal stage — quarterly governance review | Plugin published, three users, all from your team |
Stage 1: Individual — one engineer, one head
One engineer has a tuned CLAUDE.md, a handful of slash commands, and a workflow that makes their output look like magic. It is magic — right up until they leave, at which point the entire AI programme walks out with them. The Infralovers team put it cleanly: "teams that treat CLAUDE.md seriously incidentally make onboarding faster, handoffs safer, and bus factor smaller." The contrapositive is the Stage 1 trap.
The move is the cheapest one in the model: take what's in ~/.claude/CLAUDE.md and put it in .claude/CLAUDE.md, in the repo, alongside the code. That's the memory layer team-standards play — three lines of git, and you're at Stage 2.
Stage 2: Reusable — the plateau where most teams stop
This is where most teams stop. Skills exist. A few people use them. The rollout was declared done.
Roland Huß named the mechanism in one line: "In practice, Claude treats skill content as advice, not as instructions." Stronger wording doesn't help — he tried MUST, ALWAYS, CRITICAL, bold, uppercase, and none of it changed the behaviour. The model reads your skill, considers it, and makes its own judgment call. You can have a beautiful skills directory and still ship inconsistent code, because skills are advice-shaped artefacts trying to produce enforcement-shaped behaviour.
The fix DORA recommends — clear guidelines, automated verification, small batches — is Stage 3. The trigger to move on: the moment manual re-verification starts to feel repetitive.
Stage 3: Enforced — where the harness starts paying back
The first stage where the rules fire without anyone having to remember. Eddie Legg's blog post on agentic hooks gives this stage its line: "Rules are wishes. Hooks are walls." The Dotzlaw team measured the gap: prose rules in CLAUDE.md achieve 70-90% compliance. The remaining 10-30% is where production systems fail. A hook is mechanical — it exits with code 2 and the tool call never happens, not because Claude decided to comply, but because the call was blocked at the harness layer. Real enforcement looks like a Stop hook running verify.sh, a PreToolUse hook that blocks dangerous-bash, a secret-scan that exits 2 on any match. Deterministic. Boring. Load-bearing.
The heuristic I've been using with teams: target Stage 3 by the end of month 1, if you fork an existing scaffold. Build from a blank .claude/ directory and you'll be lucky to hit it in three. Once enforcement is mechanical, you can delegate.
Stage 4: Delegated — subagents complete tasks without babysitting
Stage 4 only works because Stage 3 verification is load-bearing. A subagent can declare success at any time — what stops it declaring false success is the Stop hook running the verify step from the 5-step loop. Without it, the agent's "done" is probabilistic. With it, it means the same thing as your "done."
Anthropic's own engineers show the shape of the transition. Between February and August 2025, maximum consecutive tool calls per session rose from 9.8 to 21.2 — a 116% increase in autonomous tool execution — while human turns per session dropped 33%. Stage 3 becoming Stage 4 in six months at the leading edge.
But Stage 4 has its own block, which an Anthropic engineer in the same paper names: "The cold start problem is probably the biggest blocker right now... there is a lot of intrinsic information that I just have about how my team's code base works that Claude will not have by default." Every new repo means rebuilding the same verifier, security-reviewer, explorer. Alan West names the move: "If your org has multiple repos with similar conventions, extract common agent configs into a shared package." That packaging step is the trigger to Stage 5.
Most of you aren't here yet — Jellyfish analysed 1,000+ companies and found under 8% piloting fully agentic write-and-submit workflows.
Stage 5: Distributed — where one team's work compounds across the others
Not more of the same — a phase change. The evidence is in three companies with public numbers.
Spotify Honk. Anthropic's customer story reports 650+ agent-generated pull requests merged into production per month, saving up to 90% of the time engineers spend on migrations. The line to remember: "You can't safely automate what you don't understand." The precondition was Backstage — Spotify's internal developer platform, catalogued component-by-component. Honk works because the Stage 5 artefact existed before the agent did. David Soria Parra, MCP co-creator at Anthropic, puts the inflection plainly: "going into the office one week, seeing people in front of an IDE, coming back three weeks later and seeing everyone in front of terminals only."
Box. Cursor's case study reports over 85% of 800+ developers using Cursor daily, driving a 30-50% increase in product roadmap throughput. The unlock wasn't licensing — it was a mentorship programme. And Box runs Cursor, not Claude Code: the model is tool-agnostic. Stage 5 is about practice, not product.
Anthropic-Accenture. The December 2025 partnership commits to training ~30,000 Accenture professionals on Claude. In Dario Amodei's words: "tens of thousands of Accenture developers will be using Claude Code, making this our largest ever deployment." A training rollout, not a deployed count — but the shape is Stage 5: a central practice unit, standardised configs, embedded engineers carrying the harness into client environments.
Three tool stacks, one mechanism: package the harness, distribute it, make the next team's Stage 0 your team's Stage 3.
So — where are you really?
Before I point at your team, here's my own setup on this map, honestly. Enforcement is real: hooks plus Gluon's circuit breaker sit at Stage 3, deterministic and boring. Delegation half-works — Gluon spawns subagents, and a multi-agent pipeline writes the first draft of posts like this one, which is Stage 4 on a good day. The only thing that genuinely compounds is the private skills marketplace I drag from project to project, inching toward Stage 5. The honest gap is Stage 4 reliability: auto-selection of custom agents still isn't dependable, so a lot of what I'd like to call "delegation" is really me, still holding the leash. I'm not writing this as someone who finished the journey. Nobody has.
The honest diagnostic isn't "what stage are we at?" It's "what's the next stage, and what specifically moves us there?" At Stage 2, the next move isn't more skills — it's a Stop hook that runs verify.sh. At Stage 4, it isn't another subagent — it's extracting the ones you have into a shared package. The cheapest, most-deterministic next move is almost always the right one. And skipping stages is dangerous: Replit's production database deletion in mid-2025 is what Stage 4 without Stage 3 looks like; the DORA 2024 number is what Stage 2 looks like trying to scale without enforcement. Build the wall before you delegate behind it.
The model is old. Watts Humphrey published the Capability Maturity Model at Carnegie Mellon's SEI in 1991, and Hubert and Stuart Dreyfus made the same argument about individuals in 1980 — novices follow rules, experts transcend them. Stage 5 teams aren't following the rules. They wrote the rules. And the rules now onboard the next team.
Series Navigation
- Post 1: The Governance Wall
- Post 2: The 5-Step Loop
- Post 3: The Productivity J-Curve
- Post 4: 1.2× Not 10×
- Post 5: Protect the Juniors
- Post 6: Standardise the Harness
- Post 7: The 5-Stage Maturity Model (you are here)
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
The 30 Principles for Agentic Engineering — Part 2: The Lifecycle
Principles 6–14. How work moves through an agentic engineering team: the ticket as contract, AI distillation with human curation, three gates, verification before done, characterisation tests, the 1.2× capacity rule, the J-curve, and telemetry.
The 30 Principles for Agentic Engineering — Part 1: The Kernel
Principles 1–5. The five rules that everything else in the framework rests on: standardise the harness, make verification load-bearing, default to plan mode, pick the cheapest layer, reflect every task.
The 30 Principles for Agentic Engineering — Part 5: Calibration and Reality
Principles 26–30. The calibration layer that catches what the rest of the framework would miss: a PR-noise budget, independent verification, model-swap regression discipline, the 15-tool-call rule, and protecting junior development.