Tools, Then Teammates, Then Autonomy — Part 1

Part 1 of 2.

Right now, in the company I'm working with, a person opens a client's financial statements, reads down the P&L and the balance sheet, and types the numbers into a spreadsheet. The spreadsheet has the credit model baked into its cells — ratios, weightings, thresholds. Type in the figures and it spits out a grade. It works. It has worked for years. It is also slow, error-prone in the way any retyping is error-prone, and the most expensive part of the whole exercise is a senior analyst doing data entry.

I'm replacing the typing. Not the judgment — the typing. A code-execution agent reads the statements, decides which line maps to which model input, and writes Pandas to extract and aggregate the figures. The arithmetic runs as deterministic code in a sandbox, logged and replayable, so every number traces back to the exact line of code that produced it. The language model never does the sum. It decides what to compute; Python does the computing. That distinction is the whole game, and I'll come back to why.

I've been carrying process debt for twenty years — first production systems racked in a London Docklands colocation facility before AWS existed, now AI-powered teams run from a desk in Singapore. Brownfield isn't a case study to me; it's my entire career. So when I tell you that turning an established company AI-native is an ordered path — tools, then teammates, then autonomy — and that you physically cannot reorder it, I'm mid-journey on exactly this, telling you what I can see from here.

The lie that sells, and the wall that proves it a lie

The pitch goes like this: buy the agent platform, point it at your work, and you're AI-native. Autonomy is a licence you purchase. It's seductive because demos reward flash — an autonomous agent closing the books in a slide deck is a better show than a typed function that extracts a number from a PDF. So the budget flows to the show.

Then the show meets the operation. McKinsey's 2025 State of AI survey is the cleanest statement of the gap: 62% of organisations are at least experimenting with AI agents, but only 21% have redesigned any workflow around them (McKinsey, The State of AI 2025). That 41-point gap is not a model-capability gap — the models are extraordinary. It's the distance between bolting an agent on and changing how the work is done. By one much-debated estimate, 95% of enterprise generative-AI pilots produce no measurable P&L impact (MIT Project NANDA, via Fortune) — the methodology is contested, so treat it as direction, not gospel. But it lines up with everything I've watched first-hand: pilots don't die because the model is too weak. They die at the same wall — the unglamorous first phase the autopilot pitch told you to skip.

The villain, named plainly: AI is an amplifier, not a fixer. Point it at a clean process and you scale a clean process. Point it at a broken one and you scale the breakage — faster, more consistently, now embedded in software where it's harder to fix. A human running a bad step quietly deviates to get the right outcome. An agent executes the bad step faithfully, ten thousand times a day. Michael Hammer said it in Harvard Business Review in 1990: "It is time to stop paving the cow paths. Instead of embedding outdated processes in silicon and software, we should obliterate them and start over" (Hammer, "Reengineering Work: Don't Automate, Obliterate," Harvard Business Review, July–August 1990). Thirty-five years later, the cow paths are being paved at machine speed.

I've written before about why the prototype-to-production gap is a governance problem, not a technical one. This is the same wall from a different angle. Before the three phases make sense, you have to separate two things almost everyone collapses into one.

Two axes, not one ladder

Most maturity models are a single ladder: chatbot at the bottom, autonomous enterprise at the top, climb rung by rung as a whole company. That framing causes a lot of anxiety, because it makes "are we AI-native yet?" a company-wide grade you're failing.

There are two axes. Organisational maturity is how institutionalised AI is across the whole business — governance, data, culture. Per-pipeline autonomy is how independent a single workflow is. They move at completely different speeds. Autonomy propagates pipeline-by-pipeline: you can run a fully autonomous Phase 3 agent on credit-statement extraction while every other process is still at Phase 1. Walk one pipeline through all three phases, harvest the reusable tools and evals it produces, then point them at the next. Org maturity is the slow-moving average of every pipeline's autonomy — a lagging indicator, not a thing you set.

Borrow the scale the car industry worked out. SAE grades driving automation L0 to L5: Phase 1 (Tools) is L0–L1, the tool assists; Phase 2 (Teammates) is L2–L3, the agent drives and the human is monitor and fallback; Phase 3 (Autonomy) is L4 — the agent operates independently within a defined domain, with a human for escalations. Note what's missing: L5. Nobody serious is at full, go-anywhere autonomy, and in regulated credit you don't want to be. The honest end-state is L4 in a box.

mermaid


Rendering diagram...

So let's walk one pipeline, from the bottom, where the real work is.

Phase 1 — Tools: codify it so a human or an agent can use it

Phase 1 is the part everyone buries in a "data readiness" checkbox. It's the foundation, and a named, ROI-positive phase in its own right — not prep work you grind through to earn the fun stuff.

The work is to codify the process as a typed, deterministic, verifiable function. Verifiable is the operative word: the output has to be something you can check. "Extract total liabilities from this balance sheet" has a right answer; "compute the current ratio from these two figures" has a right answer. Those go first.

The sharp version: build it so a human or an agent can use it. Digitisation is building an executable interface, not cleaning a dataset. You put the credit logic in one codified core, then hang two thin adapters off it — a human UI with a "compute grade" button, and an agent tool the LLM can call. Same code path, same audit log, two callers. When you move from Phase 2 to Phase 3, the handoff is a configuration change, not a rebuild.

mermaid


Rendering diagram...

Inside that core sits the single most important decision in the pipeline: don't let the language model do the arithmetic. Frontier models still struggle badly with multi-digit multiplication. The fix has been known since the PAL paper in late 2022: having the model write the computation and offload it to a Python interpreter, rather than reason it out in tokens, set state-of-the-art on grade-school math — 78.7% on GSM8K against 60.1% for the best chain-of-thought baseline (Gao et al., arXiv:2211.10435). This isn't research any more; Anthropic, OpenAI, E2B and Vercel all ship sandboxed code execution as a first-class primitive. In regulated credit, a wrong unauditable number is a compliance failure, not a bug.

There's an efficiency bonus too. When the agent keeps intermediate data inside the sandbox instead of passing every number back through the model's context, token cost collapses — Anthropic documented one task dropping from roughly 150,000 tokens to about 2,000, a 98.7% reduction, just by letting the code hold the data and returning only the result (Anthropic, Code execution with MCP).

And the ROI lands immediately, before any autonomy. Extraction that took a senior analyst the better part of an hour runs in seconds, on cleaner data. Real money on day one. It is not a 10× transformation, and I'd be lying if I dressed it up as one — the honest productivity number is closer to 1.2–1.5×, not 10×, and pretending otherwise is how you set a board up to feel cheated. Phase 1 pays. It pays soberly.

Phase 2 — Teammates: the agent assists, the human still ships

Phase 2 is the least contested part of the path. A CRM "new prospect" event fires. An agent gathers the client's financials — pulling attachments, or doing web research for a listed company — runs the Phase 1 grading tools, and surfaces a draft grade. The human reviews it, adjusts, and signs off, still in the loop on every decision.

What makes it safe is shadow mode. Before the agent's output influences anything, you run it in parallel: the agent produces its grade, the human's decision still ships. You're not deploying the agent — you're measuring it. Agreement rate against the human, whether its confidence tracks its correctness, which categories it gets wrong. You build a record of trust before spending any.

The part that matters most isn't the agent. It's the person. The analyst who used to type figures becomes the agent's supervisor and the author of its eval set — a promotion in responsibility, not a layoff statistic. BCG's 10-20-70 principle puts the weight exactly here: top performers spend 10% of their effort on algorithms, 20% on data and technology, and 70% on people, processes, and cultural transformation (BCG). If your transformation plan is mostly model selection, you've budgeted for the wrong thing.

For what this looks like at scale, look at DBS — Southeast Asia's largest bank, sixty years old, in my exact domain. It walked the literal sequence: the ADA data platform and PURE governance framework as the foundation since 2019, then the assist layer on top (DBS-GPT for staff, the DBS Joy customer chatbot). And the value curve, measured against control groups rather than projected, compounds the way a foundation-first build should: S$180 million in 2022, S$370 million in 2023, more than S$750 million in 2024, around S$1 billion in 2025 (Forrester; CNBC, Nov 2025). CEO Tan Su Shan named the next step in words almost identical to my own thesis: "we foresee a transition from AI as a copilot to AI operating on autopilot as we integrate agents with autonomous capabilities into workflows" (Computer Weekly).

Copilot to autopilot. And then you hit the wall everyone hits.

The wall: why the jump from Phase 2 to Phase 3 stalls everyone

This is where pilots go to die, and the reason is almost never the one people expect. The model was never the problem. The jump stalls on three things: trust, reversibility, and auditability.

Trust accrues slowly, with a human watching. Anthropic's autonomy research, drawn from 998,481 production tool calls, found 73% had a human in the loop (Anthropic, Measuring AI Agent Autonomy in Practice, Feb 2026). New operators auto-approve only about 20% of routine actions; after roughly 750 sessions with a clean track record, that rises past 40%. They don't intervene less because they stopped caring — they shift from approving every action to monitor-and-intervene, widening what runs unattended only as the record warrants.

Reversibility is the master gate. Anthropic calls it the most important input to an escalation architecture — categorise every action by how easily it can be undone. In their data only about 0.8% of actions are genuinely irreversible. But those 0.8% are the ones that need a human, and in credit they cluster around the decisions that change someone's life: declining an application, calling a default. Auto-execute the cheap, reversible stuff all day; the irreversible stuff escalates.

Auditability is where Phase 1's discipline pays its largest dividend. In regulated credit, every decision must be replayable from an immutable trace. This is the whole reason code does the math: deterministic Python is replayable evidence — you can point to the exact ratio that triggered the decision. An LLM "thinking through" the sum in prose is a story about a number, not evidence. Let the model do the arithmetic in Phase 1 and you have no audit trail in Phase 3 — discovered at the worst possible moment.

Klarna ran ahead of all three and paid for it. Its February 2024 launch was a triumph on paper: an AI assistant handling 2.3 million conversations a month — two-thirds of its customer-service chats, the equivalent work of 700 full-time agents — with a projected $40 million profit improvement (Klarna press release). Fifteen months later, the CEO told Bloomberg the AI-driven cost-cutting "has gone too far," and that "really investing in the quality of the human support is the way of the future for us" (Bloomberg, 8 May 2025). They didn't hit a model ceiling. They removed the human before the system had earned the right to operate without one.

I have a smaller, more personal version of that scar. I built an orchestrator called Gluon — it started life on a Mac mini because I kept losing track of four or five Claude Code agents across projects in tmux logs, and I needed a cockpit. Early on, before I'd built a circuit breaker into it, a single run blew through $500 in tokens. Another time, one got stuck in retry-hell for two hours — Claude had convinced itself the error was its own fault when the system just needed a restart. Nobody was harmed; it was my own money and my own afternoon. But it taught me the lesson in the cheapest possible way: autonomy is earned, not assumed. Every safeguard I added afterward — cost controls, multi-signal completion checks, the circuit breaker — was paying down the trust I'd skipped.

This is the same argument I made about why AI reviewing AI is not a review, now at operating-model scale. The human validation layer is what the wall protects, and you can only earn your way through it.

That's the wall. Part 2 walks through it — what Phase 3 actually looks like when you clear it the right way, the compliance gate that turns out to be the design, and the two gates that tell you when you're allowed to move.

---

Series Navigation

Part 1: Hitting the Wall (you are here)
Part 2: The Autonomy Gate

Part 1 of 2.

The lie that sells, and the wall that proves it a lie

Two axes, not one ladder

mermaid


Rendering diagram...

So let's walk one pipeline, from the bottom, where the real work is.

Phase 1 — Tools: codify it so a human or an agent can use it

Phase 1 is the part everyone buries in a "data readiness" checkbox. It's the foundation, and a named, ROI-positive phase in its own right — not prep work you grind through to earn the fun stuff.

mermaid


Rendering diagram...

Phase 2 — Teammates: the agent assists, the human still ships

Copilot to autopilot. And then you hit the wall everyone hits.

The wall: why the jump from Phase 2 to Phase 3 stalls everyone

This is where pilots go to die, and the reason is almost never the one people expect. The model was never the problem. The jump stalls on three things: trust, reversibility, and auditability.

---

Series Navigation

Part 1: Hitting the Wall (you are here)
Part 2: The Autonomy Gate

Tools, Then Teammates, Then Autonomy — Part 1: Hitting the Wall

The lie that sells, and the wall that proves it a lie

Two axes, not one ladder

Phase 1 — Tools: codify it so a human or an agent can use it

Phase 2 — Teammates: the agent assists, the human still ships

The wall: why the jump from Phase 2 to Phase 3 stalls everyone

The Cutler.sg Newsletter

Tools, Then Teammates, Then Autonomy — Part 2: The Autonomy Gate

The 30 Principles for Agentic Engineering — Part 2: The Lifecycle

The Governance Wall: Why Most AI Agents Can't Reach Production

Tools, Then Teammates, Then Autonomy — Part 1: Hitting the Wall

The lie that sells, and the wall that proves it a lie

Two axes, not one ladder

Phase 1 — Tools: codify it so a human or an agent can use it

Phase 2 — Teammates: the agent assists, the human still ships

The wall: why the jump from Phase 2 to Phase 3 stalls everyone

The Cutler.sg Newsletter

Tools, Then Teammates, Then Autonomy — Part 2: The Autonomy Gate

The 30 Principles for Agentic Engineering — Part 2: The Lifecycle

The Governance Wall: Why Most AI Agents Can't Reach Production