Tools, Then Teammates, Then Autonomy — Part 2: The Autonomy Gate
$ grep -n "^##" 2026-06-tools-teammates-autonomy-part-2-the-autonomy-gate.md>
Part 2 of 2.
Part 1 ended at the wall — the place pilots go to die, where the jump from Phase 2 to Phase 3 stalls on trust, reversibility, and auditability, not on the model. I closed it with a personal scar: an orchestrator I built that blew through $500 in tokens before I'd added a circuit breaker, and the lesson it taught me in the cheapest possible way — autonomy is earned, not assumed.
I should be honest about where I'm standing. I'm mid-Phase-2 on this credit pipeline, telling you what the wall looks like from the bottom of it — not from the far side. Clear it the right way and Phase 3 stops looking like science fiction. It starts looking like a factory.
Phase 3 — Autonomy: a network of specialists, with a gate
The destination is not one giant do-everything agent. It's an assembly line. A supervisor agent owns the pipeline — origination, client comms, CRM updates, document extraction, ratio computation, grading — and spawns specialist workers for each stage. You parallelise within a stage where it's safe (extract twelve documents at once) and keep the sequence explicit across stages. This is the factory logic of the last two centuries — division of labour, interchangeable parts, controlled flow — applied to knowledge work.
I know this mode is real because it's already what my desk looks like. Gluon supervises a handful of specialist agents; I watch the cockpit, I don't do each task. McKinsey describes the same shape at scale: a human team of two to five can already supervise an "agent factory" of 50 to 100 specialised agents running an end-to-end process (McKinsey, The agentic organization). The unit you supervise stops being a task and becomes a workflow.
Temper the hype, because this is where it gets sold hardest. Anthropic's multi-agent research system beat single-agent Claude by about 90% on its own evaluation — but cost roughly 15× the tokens, and the win only showed up on high-value, parallelisable work (Anthropic, How we built our multi-agent research system). A credit pipeline is mostly sequential; you can't fork origination and final decision. So the multi-agent win is narrow, and the hard part isn't the orchestration framework. It's evals, observability, and failure recovery. Frameworks are a weekend. Evals are the job.
And the human escalation gate — the thing the autopilot pitch treats as a hedge — is the compliance design. Three regulators, one convergent demand: a machine can grade the credit, but a human must own the adverse decision. Singapore says it as a principle: MAS's FEAT framework already requires "human review in high-impact decisions" and "clear intervention points and escalation procedures" (MAS FEAT) — the regulator wrote my Phase 3 caveat eight years before I did. The EU says it as a binding control: credit scoring is high-risk under Annex III of the AI Act, and Article 14 human-oversight obligations come into force on 2 August 2026 (EU AI Act, Annex III). The US says it as a litigation outcome: under ECOA, the CFPB has made clear a creditor can't hide behind a "black-box" model — every applicant has the right to specific reasons for an adverse action (CFPB guidance, Sep 2023). There's even a neat closing of the loop: MAS's Veritas initiative ran its banking fairness pilots on credit-risk scoring with UOB — my exact worked example was an official regulator test case.
So the escalation gate isn't caution bolted on at the end. It's the design that satisfies all three jurisdictions at once.
Rendering diagram...
Which raises the question I actually started with. Is the messy, iterative way I'm doing this normal?
Is this normal? Yes — and you're on the faster track
Iterating is universal; the sequence is what survivors converge on; naming it up front is the rare part. The brownfield AI-native playbook didn't exist until practitioners started writing it — agentic AI was barely deployed in enterprise as recently as 2024, the field is about eighteen months old at production scale, and there's no decade of accumulated practice to copy. We're writing it now.
The strongest empirical backbone is Stanford's Enterprise AI Playbook, drawn from 51 production deployments across 41 organisations. 77% of the hardest challenges were intangible — change management, data quality, process redesign: Phase 1, stated as an empirical result. 61% of successful projects had at least one prior failure, whose costs never show up in the final ROI. And the line they put their name to: "The difference was never the AI model. It was always the organization" (Stanford Digital Economy Lab, Enterprise AI Playbook). If your pilots have stalled, a better model won't be your fix.
The teams stuck in pilot purgatory aren't slower iterators; they're iterating in the wrong order — automating a broken process, asking the LLM to do the math, deploying with no evals, over-scoping straight to autonomy. And the order isn't a safety tax; it's where the return lives. Stanford found deployments with escalation-based oversight — the agent handles 80%-plus autonomously, humans review the exceptions — delivered a 71% median productivity gain versus 30% for "review everything" models (the authors note this may partly reflect different task types). The autonomy ladder, walked properly, is the ROI.
So you're not behind. You're on the faster track if you codify before you automate, use code for the math, and earn autonomy instead of assuming it. The order is in the phases above; what the phases don't tell you is when you're allowed to move. Two gates.
Pick the first process by scoring candidates on the six dimensions every credible framework converges on: value times frequency, structuredness and verifiability, tolerance for error, reversibility, data quality, reusability. Start high-frequency, structured, low-stakes, reversible, deterministically verifiable. OpenAI's own guidance agrees — for deterministic workflows you can cleanly enumerate, a scripted tool is often more efficient and easier to maintain than an agent (OpenAI, A Practical Guide to Building Agents). In credit, extraction and ratio computation go first; the final credit decision goes last.
Gate A (Phase 1 → 2): verifiable output, an eval baseline living in CI, clean digitised inputs, every call replayable from a trace.
Gate B (Phase 2 → 3) — a practitioner reference template, not lab gospel: agreement around 85–90% and rising; zero policy-violating critical errors in the last several hundred decisions; confidence correlation of roughly 0.7 or better; override rate stable or declining; system invariants defined ("never auto-approve above $X"); a complete reversibility map; and — non-negotiable — risk and compliance own the threshold, not engineering. Then roll out by canary: 5%, 10%, 25%, 100%. Clear the second gate and you point the harvested tools, skills, and evals at the next pipeline and start again.
The dynamo, the desk, and the order you can finally walk
Andrew Ng said AI is the new electricity. He's right — and that's exactly the warning, not the promise.
When factories first electrified, owners did the obvious thing: they ripped out the central steam engine and dropped in one big electric dynamo, keeping the same overhead shafts and belts. Productivity barely moved for about thirty years. The gains only arrived when factories were redesigned around the new reality — a small motor on every machine, the floor laid out around the flow of work instead of around the power source (Paul David, "The Dynamo and the Computer," 1990). The electricity was never the bottleneck. The floor plan was.
Bolting an agent onto an un-codified process is the dynamo-where-the-steam-engine-sat move. You've changed the power source and kept the belts. Phase 1 — codifying the process so a human or an agent can run it, deterministically and auditably — is redesigning the floor. There's no version of the gain that skips it.
I'm still mid-Phase-2 on the credit pipeline. The human still signs off on every grade, and that's correct. But the spreadsheet's days as the place a senior analyst does data entry are numbered, and the path from here to a supervised agent network is no longer a mystery — it's an order, and I can see every gate on it from here.
The greenfield crowd likes to say incumbents have hostages, not customers. I've been the incumbent for twenty years, and they have it backwards. The brownfield company owns the proprietary process, the data nobody else has, and the regulatory standing a two-year-old startup can't fake. The process debt was never the disqualification. It was the moat — locked, until now, behind work nobody wanted to do.
Series Navigation
- Part 1: Hitting the Wall
- Part 2: The Autonomy Gate (you are here)
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
Tools, Then Teammates, Then Autonomy — Part 1: Hitting the Wall
Becoming AI-native is an ordered path you walk one pipeline at a time — tools, then teammates, then autonomy. Part 1: codifying the process, the assist layer, and the wall every pilot dies at.
The 30 Principles for Agentic Engineering — Part 2: The Lifecycle
Principles 6–14. How work moves through an agentic engineering team: the ticket as contract, AI distillation with human curation, three gates, verification before done, characterisation tests, the 1.2× capacity rule, the J-curve, and telemetry.
The Governance Wall: Why Most AI Agents Can't Reach Production
The prototype-to-production gap for AI agents isn't technical — it's governance. Most organisations have nothing in this layer. The companies that build it first win the enterprise market. Everyone else stays in pilot purgatory.