The 5-Stage Maturity Model for AI-Augmented Engineering Teams
Part 7 of 7 in the agentic-engineering series. Last week's post named the five layers of the agent harness. This one names the journey through them.
Where is your team really?
Most engineering leaders I speak to give the same answer, and most of them are wrong about it. They say "we've cracked AI." What they mean is that two or three engineers have a slick CLAUDE.md, the team Notion has a skills directory, and someone gave a demo at the all-hands. That isn't a working AI engineering culture. That's Stage 2 of a five-stage progression, and Stage 2 is a plateau, not a destination. The signature is visible in the DORA 2024 data: a 25% rise in AI adoption correlated with a 7.2% drop in delivery stability. Individual productivity up. Team delivery down. Skills sit on a shelf, rules are written but not enforced, and everyone re-verifies the same things manually. The artefact exists. The outcome doesn't.
Most teams haven't even reached the agentic threshold. Stack Overflow's 2025 survey found 84% of developers use or plan to use AI tools — but only 31% use agents. The progression I'm about to describe isn't a vendor framework. It maps one-to-one onto the harness layers from last week, and each stage adds exactly one. Five stages. One layer at a time. Here's the map — find yourself on it.
Rendering diagram...
| Stage | Signal you are here | Move-to-next trigger | Anti-pattern |
|---|---|---|---|
| 1. Individual | Workflow lives in one head | Anyone else asks "how do you do X?" | Hero engineer hoards knowledge |
| 2. Reusable | Skills exist but few use them consistently | Compliance inconsistent across team | "We built skills" mistaken for "working AI culture" |
| 3. Enforced | Standards fire without anyone remembering | Verification work feels repetitive | Trying to enforce policy via CLAUDE.md prose |
| 4. Delegated | Multi-agent tasks complete without babysitting | Each new repo starts from zero | Built subagents, never packaged them |
| 5. Distributed | New hire onboards in hours, not days | Terminal stage — quarterly governance review | Plugin published, three users, all from your team |
Now the stages, one at a time, with what to do about each.
Stage 1: Individual — one engineer, one head
One engineer has a tuned CLAUDE.md, a handful of slash commands, and a workflow that makes their output look like magic. It is magic — right up until they leave or switch teams, at which point the entire AI programme walks out with them. The Infralovers team put it cleanly: "teams that treat CLAUDE.md seriously incidentally make onboarding faster, handoffs safer, and bus factor smaller." The contrapositive is the Stage 1 trap.
The signal is "workflow lives in one head." The trigger to move on is the moment a second engineer asks how you do X. The move itself is the cheapest one in the model: take what's in ~/.claude/CLAUDE.md and put it in .claude/CLAUDE.md, in the repo, alongside the code. That's the memory layer team-standards play — three lines of git, and you're at Stage 2.
Stage 2: Reusable — the plateau where most teams stop
This is where most teams stop. Skills exist. A few people use them. The rollout was declared done.
Roland Huß named the mechanism that traps Stage 2 in one line: "In practice, Claude treats skill content as advice, not as instructions." Stronger wording doesn't help. He tried MUST, ALWAYS, CRITICAL, bold, uppercase. None of it changed the underlying behaviour. The model reads your skill, considers it, and then makes its own judgment call. You can have a beautiful skills directory and still ship inconsistent code, because skills are advice-shaped artefacts trying to produce enforcement-shaped behaviour.
The DORA finding is the team-level fingerprint: individual productivity rises, delivery stability falls 7.2%. The fix DORA recommends — clear guidelines, automated verification, small batches — is Stage 3. Until you get there, the skills directory you built is a capability inventory, not a working culture. The trigger to move on is the moment manual re-verification starts to feel repetitive.
The anti-pattern is confusing the artefact with the outcome. The artefact is "we built skills." The outcome is "the team operates consistently with AI support." Not the same thing.
Stage 3: Enforced — where the harness starts paying back
Stage 3 is the first stage where the rules fire without anyone having to remember.
Eddie Legg's blog post on agentic hooks gives this stage its line: "Rules are wishes. Hooks are walls." The Dotzlaw team measured the gap: prose rules in CLAUDE.md achieve 70-90% compliance. Most of the time. The remaining 10-30% is where production systems fail. A hook is mechanical — it exits with code 2 and the tool call never happens, not because Claude decided to comply, but because the call was blocked at the harness layer.
The signal you're at Stage 3 is that standards fire without anyone remembering them. The anti-pattern is trying to encode "enforcement" in CLAUDE.md prose — that's Stage 2 with cosmetic hooks. Real enforcement looks like a Stop hook running verify.sh, a PreToolUse hook that blocks dangerous-bash, a secret-scan that exits 2 on any match. Deterministic. Boring. Load-bearing.
The heuristic I've been using with the teams I work with: target Stage 3 by the end of month 1, if you fork an existing scaffold. Build from a blank .claude/ directory and you'll be lucky to hit it in three. Once enforcement is mechanical, you can delegate.
Stage 4: Delegated — subagents complete tasks without babysitting
Stage 4 only works because Stage 3 verification is load-bearing. A subagent can declare success at any time — what stops it declaring false success is the Stop hook running the verify step from the 5-step loop. Without that, the agent's "done" is probabilistic. With it, it means the same thing as your "done."
Anthropic's own engineers show the shape of this transition. Between February and August 2025, maximum consecutive tool calls per session rose from 9.8 to 21.2 — a 116% increase in autonomous tool execution — while human turns per session dropped 33%. That's Stage 3 becoming Stage 4 in six months at the leading edge.
But Stage 4 has its own block. An Anthropic engineer in the same paper names it: "The cold start problem is probably the biggest blocker right now. And by cold start, I mean there is a lot of intrinsic information that I just have about how my team's code base works that Claude will not have by default." Every new repo means rebuilding the same verifier, the same security-reviewer, the same explorer. Alan West names the move: "If your org has multiple repos with similar conventions, extract common agent configs into a shared package." That packaging step is the trigger to Stage 5.
Most of you are not here yet, and that's fine. Jellyfish analysed 1,000+ companies and found under 8% piloting fully agentic write-and-submit workflows. If you can get to Stage 4 by month 3 — assuming Stage 3 is real — you're well ahead of the field.
Stage 5: Distributed — where one team's work compounds across the others
Stage 5 is where one team's good work compounds across the others. It isn't more of the same. It's a phase change — and the evidence is in three companies that have public numbers.
Spotify Honk. Anthropic's customer story reports 650+ agent-generated pull requests merged into production per month, saving engineers up to 90% of the time they'd spend on migrations. The line to remember: "You can't safely automate what you don't understand." The precondition was Backstage — Spotify's internal developer platform, catalogued component-by-component. Honk works because the Stage 5 artefact existed before the agent did. David Soria Parra, MCP co-creator at Anthropic, puts the inflection plainly: "That has really been the shift for me — going into the office one week, seeing people in front of an IDE, coming back three weeks later and seeing everyone in front of terminals only."
Box. Cursor's case study reports over 85% of 800+ developers using Cursor daily, driving a 30-50% increase in product roadmap throughput. The unlock wasn't licensing — it was a mentorship programme. Worth naming: Box runs Cursor, not Claude Code. The maturity model is tool-agnostic. Stage 5 is about practice, not product.
Anthropic-Accenture. The December 2025 partnership commits to training ~30,000 Accenture professionals on Claude. In Dario Amodei's words: "tens of thousands of Accenture developers will be using Claude Code, making this our largest ever deployment." A training rollout, not a deployed count — but the shape is Stage 5: a central practice unit, standardised configs, embedded engineers carrying the harness into client environments.
Three companies, three tool stacks, one mechanism: package the harness, distribute it, make the next team's Stage 0 your team's Stage 3.
So — where are you really?
The honest diagnostic isn't "what stage are we at?" It's "what's the next stage, and what specifically moves us there?" If your team is at Stage 2, your next move isn't more skills. It's a Stop hook that runs verify.sh. If you're at Stage 4, your next move isn't another subagent. It's extracting the ones you have into a shared package. The cheapest, most-deterministic next move is almost always the right one.
The heuristic I've been using: Stage 3 in month 1 if you fork a scaffold; Stage 4 by month 3 if Stage 3 is real; Stage 5 only after Stage 4 is stable across more than one repo. That isn't Anthropic doctrine — it's my inference from the six-month autonomy data and what I've seen working with teams. Skipping stages is dangerous. Replit's production database deletion in mid-2025 is what Stage 4 without Stage 3 looks like; the DORA 2024 stability number is what Stage 2 looks like when it tries to scale without enforcement. Build the wall before you delegate behind it.
Watts Humphrey first published the Capability Maturity Model at Carnegie Mellon's SEI in 1991 on the same insight: organisations pass through identifiable capability stages, not in jumps. Hubert and Stuart Dreyfus made the same argument about individuals in 1980 — novices follow rules, experts transcend them. Stage 5 teams aren't following the rules. They wrote the rules. And the rules now onboard the next team.
Rules are wishes. Hooks are walls. The shape of the journey is settled. Where you are on it isn't.
Series Navigation
- Post 1: The Governance Wall
- Post 2: The 5-Step Loop
- Post 3: The Productivity J-Curve
- Post 4: 1.2× Not 10×
- Post 5: Protect the Juniors
- Post 6: Standardise the Harness
- Post 7: The 5-Stage Maturity Model (you are here)
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
The 5-Step Loop: Why Your Agent Fails at Step 4
ReAct gave us a three-step loop. Production hardened it into five. The two new steps — Plan and Verify — are where everything that goes wrong, goes wrong. And the field has now named the worst offender.
The Hidden Arsenal: How My Dotfiles Unlocked 10x Productivity with AI Coding Assistants
After 12 months of systematic optimization, I've documented 50-70% productivity gains with AI coding assistants. The secret isn't just using AI tools—it's teaching them to think like you do through carefully crafted configurations.
Standardise the Harness, Customise the Work: The 5-Layer Agent Architecture
Three open-source extractions converged on the same five layers. The architecture isn't a vendor narrative — it's a discovered structure. Here's the decision rule that keeps you from over-engineering it.