Three Topologies: Single Agent, Supervisor, or Swarm
Multi-agent sounds powerful. It also sounds expensive.
Anthropic's own engineering team published the numbers in June 2025: their multi-agent Research feature — Opus 4 lead, Sonnet 4 subagents — beat single-agent Opus 4 by 90.2% on their internal research eval. The same post, three paragraphs later: "multi-agent systems use about 15× more tokens than chats." For deep research that compounds across reading lists, that bill makes sense. For the bug you're trying to fix today, it doesn't.
The decision most teams quietly skip is the one that costs them the most money: which topology fits the work? Pick by work-shape, not by what sounds interesting at conference. There are three real options.
Topology 1: Single agent (ReAct loop)
Rendering diagram...
One agent. One context window. One Think → Act → Observe loop, repeated until done. This is the ReAct pattern (Yao et al., ICLR 2023) that Claude Code, Cursor agent mode, and most everyday coding agents implement.
Anthropic's Building Effective Agents (December 2024) gives the rule: "We recommend finding the simplest solution possible, and only increasing complexity when needed." Single agent is the simplest solution. It fits when the work fits in one head.
Work shapes that fit: fix a failing test, implement a typed endpoint, refactor a function, add a CLI flag, write a migration. Anything you'd hand to a competent engineer with a clear ticket. Anthropic is explicit that "most coding tasks involve fewer truly parallelizable tasks than research, and LLM agents are not yet great at coordinating and delegating to other agents in real time" — which is the polite way of saying: don't reach for multi-agent for normal code work.
When single-agent breaks down: when the work won't fit in the context window even with retrieval, when subtasks are genuinely independent and parallelisable, or when you need separate "specialist" personas (researcher / writer / fact-checker) operating with isolated state.
Topology 2: Supervisor (orchestrator + workers)
Rendering diagram...
A supervisor agent decomposes the task, spawns workers, and synthesises the results. Workers don't talk to each other — only to the supervisor. This is Anthropic's "orchestrator-workers" workflow and the same pattern LangGraph Supervisor (February 2025) implements as a first-class library. Claude Code's Task tool subagents are an implementation of the same shape.
The numbers from Anthropic's Research deployment are the cleanest evidence in either direction:
"A multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on our internal research eval."
"Multi-agent systems use about 15× more tokens than chats."
The 90.2% headline is real. The 15× is the price tag. And the post also makes clear that "token usage by itself explains 80% of the variance" on the BrowseComp benchmark — meaning the gain is largely a function of throwing more parallel context at the problem, not architectural cleverness. If the work isn't breadth-first — if it doesn't decompose into independent sub-investigations — the topology won't earn the premium.
Work shapes that fit: auditing a codebase for a class of vulnerability (each file is an independent worker), researching a topic across many sources, multi-file architectural surveys, parallel evaluation of candidate approaches.
When supervisor breaks down: when subtasks aren't independent, when the synthesis is harder than the decomposition, or when you're paying the 15× tax to do something single-agent could finish in one loop.
Topology 3: Swarm (cron-driven, on rails)
Rendering diagram...
This is the topology everyone wants and almost nobody actually builds correctly. The "agents ship features while you sleep" narrative is true — but only in a constrained sense. Every documented production swarm runs on rails: cron schedule or scoped trigger, external state machine, hard gates, sandboxed execution.
Spotify's Honk is the cleanest public example. 1,500+ merged PRs since launch in mid-2024 — roughly half of all PRs at Spotify on the workloads it's pointed at. The architecture is the rails:
"The agent runs in a container with limited permissions, few binaries, and virtually no access to surrounding systems. It's highly sandboxed."
— Spotify Engineering, Honk Part 3 (Dec 2025)
Honk uses an LLM-as-judge that vetoes 25% of all agent sessions before they reach review. Devin, the most-marketed "autonomous engineer," lands a 67% PR merge rate — meaning a third of its PRs are rejected by humans at best-in-class deployment. Goldman Sachs runs Devin alongside rules-based systems and human oversight. None of these are uncapped.
The counterexample is instructive. In mid-2025, Replit's agent deleted a production database during a code freeze. Different conditions, same lesson: when the rails come off, the swarm doesn't degrade gracefully. It produces the worst possible version of "autonomous."
Work shapes that fit: dependency upgrades across a fleet of repos, scheduled security fixes, lint cleanup, batch refactors with a clear contract (Java EOL migration), CRUD-shaped repetitive tasks where the gate is cheap and the cost of a bad PR is just rejection.
When swarm breaks down: when the work isn't homogeneous, when the gate can't catch regressions cheaply, or when you don't have an internal developer platform (the rails) in the first place. Spotify's engineers put this directly: "You can't safely automate what you don't understand."
Decision matrix
| Question | Single agent | Supervisor | Swarm |
|---|---|---|---|
| Does work fit in one context window? | ✓ | — | — |
| Breadth-first across independent sub-problems? | — | ✓ | — |
| Homogeneous, repeatable, fleet-scale? | — | — | ✓ |
| Token budget tight? | ✓ | × (15× tax) | × |
| Latency-sensitive? | ✓ | × | — |
| Have hard gates + external state + sandbox? | not needed | helps | required |
| Default choice for most coding | ✓ | — | — |
Pick the topology by the question that gets a ✓. Don't pay for one further right than the work demands.
The marketing vs the reality
The "uncapped autonomous swarm" story sells conference tickets. It does not survive a deployment review. Every production swarm in the public record runs on cron, scoped triggers, hard gates, and external state — exactly the constraints that make it predictable enough to deploy. Strip those, and you get the Replit story, not the Honk story.
This is the bit to bring to a sceptical CTO. The honest pitch isn't "autonomous engineers shipping while you sleep." It's "a constrained, gated worker pool for the slice of work you've already understood well enough to standardise." Less impressive on a slide. Considerably more shippable.
Anthropic's own conclusion in Building Effective Agents still holds: "Success in the LLM space isn't about building the most sophisticated system. It's about building the right system for your needs." For most coding, that's a single agent. For breadth-first work, a supervisor — and the 15× bill. For repeatable fleet work, a swarm — and the rails to keep it on the track.
Pick the shape of the work, not the shape of the demo.
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
From Solo Tool to Team Infrastructure: Scaling Gluon for Production
When I first built Gluon on my Mac mini, I was solving a personal problem: monitoring Claude agents without losing my mind to tmux logs. But when teams join the picture, everything changes — security, governance, observability, and the fundamental role of the developer. Here's what production infrastructure for autonomous agents looks like.
The 30 Principles for Agentic Engineering — Part 5: Calibration and Reality
Principles 26–30. The calibration layer that catches what the rest of the framework would miss: a PR-noise budget, independent verification, model-swap regression discipline, the 15-tool-call rule, and protecting junior development.
The 30 Principles for Agentic Engineering — Part 3: The Harness
Principles 15–20. The harness configuration that keeps the kernel and lifecycle cheap: CLAUDE.md under 200 lines, hooks for real incidents, skills that auto-invoke, subagent isolation, pinning, and Stage 5 distribution.