Tools, Then Teammates, Then Autonomy: A Field Guide to Turning a Brownfield Company AI-Native
$ grep -n "^##" 2026-06-tools-teammates-autonomy.md>
- 10:The lie that sells, and the wall that proves it a lie
- 22:Two axes, not one ladder
- 50:Phase 1 — Tools: codify it so a human or an agent can use it
- 86:Phase 2 — Teammates: the agent assists, the human still ships
- 98:The wall: why the jump from Phase 2 to Phase 3 stalls everyone
- 116:Phase 3 — Autonomy: a network of specialists, with a gate
- 157:Is this normal? Yes — and you're on the faster track
- 169:The playbook: Step 0 to Step 3, with the gates between
- 185:The dynamo, the desk, and the order you can finally walk
Right now, in the company I'm working with, a person opens a client's financial statements, reads down the P&L and the balance sheet, and types the numbers into a spreadsheet. The spreadsheet has the credit model baked into its cells — ratios, weightings, thresholds. Type in the figures and it spits out a grade. It works. It has worked for years. It is also slow, it is error-prone in the way any retyping is error-prone, and the most expensive part of the whole exercise is a senior analyst doing data entry.
I'm replacing the typing. Not the judgment — the typing. A code-execution agent reads the statements, decides which line maps to which model input, and writes Pandas to extract and aggregate the figures. The arithmetic runs as deterministic code in a sandbox, logged and replayable, so every number that lands in the grading step can be traced back to the exact line of code that produced it. The language model never does the sum. It decides what to compute; Python does the computing. That distinction is the whole game, and I'll come back to why.
I've been carrying process debt for twenty years. I ran my first production systems on a physical server racked in a London Docklands colocation facility, before AWS existed, and I now run AI-powered teams from a desk in Singapore. Brownfield isn't a case study to me. It's my entire career. So when I tell you that turning an established company AI-native is an ordered path — tools, then teammates, then autonomy — and that you physically cannot reorder it, I'm not theorising. I'm mid-journey on exactly this, and I'm telling you what I can see from here.
Everyone wants to skip to the end of this story. Here's what they're skipping past.
The lie that sells, and the wall that proves it a lie
The pitch you've heard goes like this: buy the agent platform, point it at your work, and you're AI-native. The unit of transformation is a product. Autonomy is a licence you purchase.
It's seductive because demos reward flash. An autonomous agent booking travel or closing the books in a slide deck is a better show than a typed function that extracts a number from a PDF. So the budget flows to the show.
Then the show meets the operation, and the gap is enormous. McKinsey's 2025 State of AI survey is the cleanest single statement of it: 62% of organisations are at least experimenting with AI agents, but only 21% have redesigned any workflow around them (McKinsey, The State of AI 2025). That 41-point gap is not a model-capability gap. The models are extraordinary. It's the distance between bolting an agent on and actually changing how the work is done. By one widely-cited and much-debated estimate, 95% of enterprise generative-AI pilots produce no measurable P&L impact (MIT Project NANDA, via Fortune) — the methodology is contested, so treat it as direction, not gospel. But the direction is unmistakable, and it lines up with everything I've watched first-hand: pilots don't die because the model is too weak. They die at the same wall, and the wall is the unglamorous first phase that the autopilot pitch told you to skip.
Here's the villain, named plainly: AI is an amplifier, not a fixer. Point it at a clean process and you scale a clean process. Point it at a broken one and you scale the breakage — faster, more consistently, and now embedded in software where it's harder to fix. A human running a bad step will quietly deviate to get the right outcome. An agent will execute the bad step faithfully, ten thousand times a day. Michael Hammer said it in Harvard Business Review in 1990, and it lands harder now than it did then: "It is time to stop paving the cow paths. Instead of embedding outdated processes in silicon and software, we should obliterate them and start over" (Hammer, "Reengineering Work: Don't Automate, Obliterate," Harvard Business Review, July–August 1990). Thirty-five years later, the cow paths are being paved at machine speed.
I've written before about why the prototype-to-production gap is a governance problem, not a technical one. This is the same wall from a different angle. Before the three phases make sense, you have to separate two things almost everyone collapses into one.
Two axes, not one ladder
Most maturity models are a single ladder: chatbot at the bottom, autonomous enterprise at the top, climb rung by rung as a whole company. That framing is wrong in a way that quietly causes a lot of anxiety, because it makes "are we AI-native yet?" a company-wide grade you're failing.
There are two axes, not one. Organisational maturity is how institutionalised AI is across the whole business — governance, data, culture, the boring infrastructure. Per-pipeline autonomy is how independent a single workflow is. They move at completely different speeds, and conflating them is what makes transformation feel impossible.
Once you separate them, the pressure lifts. Autonomy propagates pipeline-by-pipeline, not company-wide: you can run a fully autonomous Phase 3 agent on credit-statement extraction while every other process in the business is still at Phase 1. You don't transform a company in one move. You walk one pipeline through all three phases, harvest the reusable tools and evals it produces, then point them at the next pipeline. The maturity of the org is the slow-moving average of every pipeline's autonomy — a lagging indicator, not a thing you set.
It helps to borrow the scale the car industry already worked out. SAE grades driving automation from L0 to L5. The three phases map onto it cleanly: Phase 1 (Tools) is roughly L0–L1, the human does everything and the tool assists. Phase 2 (Teammates) is L2–L3, the agent drives and the human is the monitor and fallback. Phase 3 (Autonomy) is L4 — the agent operates independently within a defined domain, with a human for escalations and out-of-domain calls. Note what's missing: L5. Nobody serious is at full, go-anywhere autonomy, and in regulated credit you don't want to be. The honest end-state is L4 in a box, and saying so is what keeps the rest of this credible.
Rendering diagram...
So let's walk one pipeline. Start at the bottom, where the real work is.
Phase 1 — Tools: codify it so a human or an agent can use it
Phase 1 is the part everyone treats as plumbing and buries in a "data readiness" checkbox. I think it's the foundation, and the most ownable claim I have is that it's a named, ROI-positive phase in its own right — not prep work, not a prerequisite you grind through to earn the fun stuff.
The work is to take the process and codify it as a typed, deterministic, verifiable function. Verifiable is the operative word: the output has to be something you can check — a ratio, an extracted figure, a grade. In the credit example, "extract total liabilities from this balance sheet" has a right answer. "Compute the current ratio from these two figures" has a right answer. Those are the things you codify first.
The sharp version of this, the line I keep coming back to: build it so a human or an agent can use it. Digitisation isn't cleaning a dataset. It's building an executable interface. You put the credit logic in one codified core, then hang two thin adapters off it — a human UI with a "compute grade" button, and an agent tool the LLM can call. Same code path. Same audit log. Two callers. When you're ready to move from Phase 2 to Phase 3, the human-to-agent handoff is a configuration change, not a rebuild, because you built for both from the start.
Rendering diagram...
Inside that core sits the single most important technical decision in the whole pipeline, and it's why I keep insisting code does the math. Don't let the language model do the arithmetic: frontier models still struggle badly with multi-digit multiplication, while writing the calculation as Python and executing it beat the best chain-of-thought model of its day by about 18 points on GSM8K — in regulated credit, a wrong unauditable number is a compliance failure, not a bug. The arithmetic frailty is well documented — even strong models degrade sharply as the digit count climbs. The fix has been known since the PAL paper in late 2022, which showed that having the model write the computation and offload it to a Python interpreter, rather than reason it out in tokens, set state-of-the-art on grade-school math: 78.7% on GSM8K against 60.1% for the best chain-of-thought baseline (Gao et al., arXiv:2211.10435). This isn't research any more. Anthropic, OpenAI, E2B and Vercel all ship sandboxed code execution as a first-class primitive. The model decides what to compute; the runtime computes it; you get a number you can audit.
There's an efficiency bonus that surprised me the first time I measured something similar. When the agent keeps intermediate data inside the sandbox instead of passing every number back through the model's context, the token cost collapses. Anthropic documented one task dropping from roughly 150,000 tokens to about 2,000 — a 98.7% reduction — just by letting the code hold the data and returning only the result (Anthropic, Code execution with MCP).
And the ROI lands immediately, before any autonomy at all. Extraction that took a senior analyst the better part of an hour runs in seconds; the data going into the scoring step is cleaner and consistent. That's real money on day one of Phase 1. It is not, however, a 10× transformation of the business, and I'd be lying to you if I dressed it up as one — the honest productivity number is closer to 1.2–1.5×, not 10×, and pretending otherwise is how you set a board up to feel cheated. Phase 1 pays. It pays soberly.
Once the tool exists, the agent can pick it up. But you don't hand it the wheel yet.
Phase 2 — Teammates: the agent assists, the human still ships
Phase 2 is the least contested part of the path, so I'll move quickly. A CRM "new prospect" event fires. An agent gathers the client's financials — pulling attachments, or doing web research for a listed company — runs the Phase 1 grading tools, and surfaces a draft grade. The human reviews it, adjusts, and signs off. The human is still in the loop on every decision.
The trick that makes Phase 2 safe is shadow mode. Before the agent's output is allowed to influence anything, you run it in parallel with the human: the agent processes the same inputs, produces its grade, and the human's decision still ships. You're not deploying the agent yet. You're measuring it — agreement rate against the human, whether its confidence tracks its correctness, which categories of input it gets wrong. You build a record of trust before you spend any.
The part that matters most here isn't the agent. It's the person. The analyst who used to type figures into the spreadsheet becomes the agent's supervisor and the author of its eval set. That's not a layoff statistic; it's a promotion in responsibility. BCG's much-cited 10-20-70 principle puts the weight exactly here: top performers spend 10% of their effort on algorithms, 20% on data and technology, and 70% on people, processes, and cultural transformation (BCG). The 70% is the people whose jobs change. If your transformation plan is mostly model selection, you've budgeted for the wrong thing.
For what this looks like at scale in a regulated incumbent, look at DBS — Southeast Asia's largest bank, sixty years old, in my exact domain. It walked the literal sequence. The foundation was the ADA data platform and the PURE governance framework (Purposeful, Unsurprising, Respectful, Explainable), in place since 2019. On top of that came the assist layer: DBS-GPT for staff and the DBS Joy customer chatbot. And the value curve, measured against control groups rather than projected, compounds the way a foundation-first build should: S$180 million in 2022, S$370 million in 2023, more than S$750 million in 2024, and around S$1 billion in 2025 (Forrester; CNBC, Nov 2025). By late 2025 DBS reported 430-plus AI use cases on 2,000-plus models. CEO Tan Su Shan named the next step in words almost identical to my own thesis: "we foresee a transition from AI as a copilot to AI operating on autopilot as we integrate agents with autonomous capabilities into workflows" (Computer Weekly).
Copilot to autopilot. And then you hit the wall everyone hits.
The wall: why the jump from Phase 2 to Phase 3 stalls everyone
This is where pilots go to die, and the reason is almost never the one people expect. The model was never the problem.
The jump from human-assisted AI to autonomous AI stalls not on model capability but on three things — trust, reversibility, and auditability; in Anthropic's data, 73% of real agent tool-calls still keep a human in the loop, and trust accrues so slowly that operators only widen the autonomy gate after hundreds of clean sessions. Anthropic's autonomy research, drawn from 998,481 production tool calls, found that 73% of them had a human in the loop (Anthropic, Measuring AI Agent Autonomy in Practice, Feb 2026). And the shape of how trust is earned is the counter-intuitive part: new operators auto-approve only about 20% of routine actions; after roughly 750 sessions with a clean track record, that rises past 40%. They don't intervene less because they stopped caring — they shift from approving every action to a monitor-and-intervene pattern, expanding what runs unattended only as the record warrants it. The mature operating mode is monitor-and-intervene, not fire-and-forget. Trust doesn't arrive at a discrete cutover. It accrues, slowly, with a human watching.
Reversibility is the master gate. Anthropic's framing — categorise every action an agent can take by how easily it can be undone — calls reversibility the most important input to an escalation architecture. In their data only about 0.8% of actions are genuinely irreversible. But those 0.8% are precisely the ones that need a human, and in credit they cluster around the decisions that change someone's life: declining an application, calling a default. You can auto-execute the cheap, reversible stuff all day. The irreversible stuff escalates.
Auditability is where Phase 1's discipline finally pays its largest dividend. In regulated credit, every decision must be replayable from an immutable trace. This is the whole reason code does the math: deterministic Python is replayable evidence — you can point to the exact ratio that triggered the decision. An LLM "thinking through" the sum in prose is not evidence. It's a story about a number. If you let the model do the arithmetic in Phase 1, you have no audit trail in Phase 3, and you discover that at the worst possible moment.
The cautionary tales make the wall visceral. Klarna ran ahead of all three and paid for it. Its February 2024 launch was a genuine triumph on paper: an AI assistant handling 2.3 million conversations a month — two-thirds of its customer-service chats, the equivalent work of 700 full-time agents — with a projected $40 million profit improvement (Klarna press release). Fifteen months later, the CEO told Bloomberg the AI-driven cost-cutting "has gone too far," and that "really investing in the quality of the human support is the way of the future for us" (Bloomberg, 8 May 2025). They didn't hit a model ceiling. They removed the human before the system had earned the right to operate without one.
I have a smaller, more personal version of that scar. I built an orchestrator called Gluon — it started life on a Mac mini because I kept losing track of four or five Claude Code agents across projects in tmux logs, and I needed a cockpit. Early on, before I'd built a circuit breaker into it, a single run blew through $500 in tokens. Another time, one got stuck in retry-hell for two hours — Claude had convinced itself the error was its own fault when the system just needed a restart. Nobody was harmed; it was my own money and my own afternoon. But it taught me the lesson in the cheapest possible way: autonomy is earned, not assumed. Every safeguard I added afterward — cost controls, multi-signal completion checks, the circuit breaker — was me paying down the trust I'd skipped.
This is the same argument I made about why AI reviewing AI is not a review, now at operating-model scale. The human validation layer is the thing the wall is built to protect, and you cannot automate your way around it. You can only earn your way through it.
I should be honest about where I'm standing. I am not at Phase 3 on this credit pipeline. I'm mid-Phase-2, and I'm telling you what the wall looks like from the bottom of it — not from the far side. That vantage is the point: clear it the right way and Phase 3 stops looking like science fiction. It starts looking like a factory.
Phase 3 — Autonomy: a network of specialists, with a gate
The destination is not one giant do-everything agent. It's an assembly line. A supervisor agent owns the pipeline — origination, client comms, CRM updates, document extraction, ratio computation, grading — and spawns specialist workers for each stage. You parallelise within a stage where it's safe (extract twelve documents at once) and keep the sequence explicit across stages. This is the factory logic of the last two centuries — division of labour, interchangeable parts, controlled flow — applied to knowledge work.
I know this mode is real because it's already what my desk looks like. Gluon supervises a handful of specialist agents; I'm the human watching the cockpit, not the one doing each task. McKinsey describes the same shape at scale: a human team of two to five people can already supervise an "agent factory" of 50 to 100 specialised agents running an end-to-end process (McKinsey, The agentic organization). The unit you supervise stops being a task and becomes a workflow.
Now temper the hype, because this is where it gets sold hardest. Anthropic's multi-agent research system beat single-agent Claude by about 90% on its own evaluation — but it cost roughly 15× the tokens, and the win only showed up on high-value, parallelisable work (Anthropic, How we built our multi-agent research system). A credit pipeline is mostly sequential; you can't fork origination and final decision. So the multi-agent win is narrow, and the hard part isn't the orchestration framework at all. The hard part is evals, observability, and failure recovery. The gap between a good agent system and a bad one is almost never the framework. Frameworks are a weekend. Evals are the job.
And the human escalation gate — the thing the autopilot pitch treats as a hedge — is actually the compliance design. Three regulators, one convergent demand: a machine can grade the credit, but a human must own the adverse decision. Singapore says it as a principle: MAS's FEAT framework already requires "human review in high-impact decisions" and "clear intervention points and escalation procedures" (MAS FEAT) — the regulator wrote my Phase 3 caveat eight years before I did. The EU says it as a binding control: credit scoring is high-risk under Annex III of the AI Act, and Article 14 human-oversight obligations come into force on 2 August 2026 (EU AI Act, Annex III). The US says it as a litigation outcome: under ECOA, the CFPB has made clear a creditor can't hide behind a "black-box" model — every applicant has the right to specific reasons for an adverse action (CFPB guidance, Sep 2023). There's even a neat closing of the loop here: MAS's Veritas initiative ran its banking fairness pilots on credit-risk scoring with UOB — my exact worked example was an official regulator test case.
So the escalation gate isn't caution bolted on at the end. It's the design that satisfies all three jurisdictions at once. Draw it as a feature, not a dead-end.
Rendering diagram...
Which raises the question I actually started with. Is the messy, iterative way I'm doing this normal?
Is this normal? Yes — and you're on the faster track
I'll give you the verdict up front, because it's the reason this post exists: iterating is universal, the sequence is what survivors converge on, and naming it up front is the rare part. The brownfield AI-native playbook didn't exist until practitioners started writing it. By most accounts agentic AI was barely deployed in enterprise as recently as 2024; the field is about eighteen months old at production scale. There is no decade of accumulated practice to copy. We're writing it now.
The strongest empirical backbone for this is Stanford's Enterprise AI Playbook, drawn from 51 production deployments across 41 organisations. Three findings carry the whole argument. 77% of the hardest challenges were intangible — change management, data quality, process redesign. That's Phase 1, stated as an empirical result. 61% of successful projects had at least one prior failure, whose costs never show up in the final ROI. And the line they put their name to, verbatim: "The difference was never the AI model. It was always the organization" (Stanford Digital Economy Lab, Enterprise AI Playbook).
Read that again, because it's the permission slip a lot of teams need. The difference was never the model. So if your pilots have stalled, the model isn't your problem and a better model won't be your fix.
The teams stuck in pilot purgatory aren't slower iterators. They're iterating in the wrong order — automating a broken process, asking the LLM to do the math, deploying with no evals, over-scoping straight to autonomy. They pay the same lessons twice, expensively. And the order isn't a safety tax; it's where the return lives. Stanford found that deployments with escalation-based oversight — the agent handles 80%-plus autonomously and humans review the exceptions — delivered a 71% median productivity gain, versus 30% for "review everything" approval models (the authors note this may partly reflect different task types). The autonomy ladder, walked properly, is the ROI.
If you're a CTO or a head of transformation reading this with a working demo and a production reality you can't close the gap to — you're not behind. Your iterative path is normal, because the playbook didn't exist until people like us wrote it. You're on the faster track if you codify before you automate, use code for the math, and earn autonomy instead of assuming it. So here's the order, with the gates that tell you when to move — so you can skip the part where I learned it the hard way.
The playbook: Step 0 to Step 3, with the gates between
Step 0 — Pick the first process. Score candidates on six dimensions every credible framework converges on: value times frequency, structuredness and verifiability, tolerance for error, reversibility, data quality, and reusability. The rule: start with a high-frequency, structured, low-stakes, reversible process whose output you can verify deterministically. OpenAI's own guidance points the same way — for deterministic workflows where you can cleanly enumerate every step and exception, a scripted tool is often more efficient and easier to maintain than an agent (OpenAI, A Practical Guide to Building Agents). In credit, that means extraction and ratio computation go first; the final credit decision — high-stakes, regulated, less reversible — goes last.
Step 1 — Build the tools (Phase 1). Codify the process as a typed, deterministic, verifiable function. Run all math in a sandboxed runtime — the LLM decides what to compute, Python does the arithmetic. Package it as a reusable Agent Skill or MCP tool. Put two thin adapters on the one core: a human UI and an agent tool. Write the eval set — including the compliance cases — and wire it into CI. Turn tracing on from day one.
Gate A (1 to 2): verifiable output, an eval baseline living in CI, digitised and clean inputs, and every call replayable from a trace.
Step 2 — Teammates (Phase 2). The human leads; the agent assists through the same tool, behind human-in-the-loop approvals. Run it in shadow mode beside the human and track agreement rate, confidence calibration, and error categories. Spend the budget per 10-20-70 — most of it on the people whose role is changing.
Gate B (2 to 3) — and treat these numbers as a practitioner reference template, not lab gospel: agreement around 85–90% and rising; zero policy-violating critical errors in the last several hundred decisions; confidence correlation of roughly 0.7 or better; override rate stable or declining; system invariants defined ("never auto-approve above $X"); a complete reversibility map; and — non-negotiable — risk and compliance own the threshold, not engineering. Then roll out by canary, not big-bang: 5%, then 10%, then 25%, then 100%.
Step 3 — Autonomy (Phase 3). A supervisor agent owns the pipeline and spawns specialist workers. Humans drop to monitor-and-intervene; irreversible and high-stakes actions still escalate to the gate. A team of two to five supervises the factory. And then you do the thing that closes the loop from the top of this post: you point the harvested tools, skills, and evals at the next pipeline and start again. Capability propagates workflow-by-workflow.
That's the spine. Now drop the checklist voice, because the last thing I want to leave you with isn't a checklist.
The dynamo, the desk, and the order you can finally walk
Andrew Ng said AI is the new electricity. He's right — and that's exactly the warning, not the promise.
When factories first electrified, owners did the obvious thing: they ripped out the central steam engine and dropped in one big electric dynamo, keeping the same overhead shafts and belts. Productivity barely moved for about thirty years. The gains only arrived when factories were redesigned around the new reality — a small motor on every machine, the floor laid out around the flow of work instead of around the power source (Paul David, "The Dynamo and the Computer," 1990). The electricity was never the bottleneck. The floor plan was.
Bolting an agent onto an un-codified process is the dynamo-where-the-steam-engine-sat move. You've changed the power source and kept the belts. Phase 1 — codifying the process so a human or an agent can run it, deterministically and auditably — is redesigning the floor. There's no version of the gain that skips it, because the gain was always going to come from the redesign, not the wiring.
I'm still mid-Phase-2 on the credit pipeline. The human still signs off on every grade, and that's correct. But the spreadsheet's days as the place a senior analyst does data entry are numbered, and the path from here to a supervised agent network is no longer a mystery — it's an order, and I can see every gate on it from where I'm standing.
The greenfield crowd likes to say incumbents have hostages, not customers — that the legacy company is the one with everything to lose. I've been the incumbent for twenty years, and I think they have it backwards. The brownfield company owns the proprietary process, the data nobody else has, and the regulatory standing a two-year-old startup can't fake. The process debt was never the disqualification. It was the moat — locked, until now, behind work nobody wanted to do. The incumbents who quietly walk the order, one pipeline at a time, will out-build the ones who bought the autopilot. Twenty years of process debt, and the way out of it turns out to be the same thing it always was for the factory floor: not a faster engine, but a redesign you can finally afford to do.
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
The 30 Principles for Agentic Engineering — Part 2: The Lifecycle
Principles 6–14. How work moves through an agentic engineering team: the ticket as contract, AI distillation with human curation, three gates, verification before done, characterisation tests, the 1.2× capacity rule, the J-curve, and telemetry.
The Governance Wall: Why Most AI Agents Can't Reach Production
The prototype-to-production gap for AI agents isn't technical — it's governance. Most organisations have nothing in this layer. The companies that build it first win the enterprise market. Everyone else stays in pilot purgatory.
OpenAI's AgentKit: Late to the Agent Party or Strategic Masterstroke?
I've built the kind of agent framework AgentKit competes with. So when OpenAI shipped it two years "late," I knew exactly which problem they were actually solving — and which one they weren't.