Your AI Team Did Nothing While You Slept
For a month in 2025, Anthropic handed Claude a real business: a vending operation in their San Francisco office, real money, real inventory, a real Slack channel for customers. They called the agent Claudius and let it run autonomously — set prices, order stock, talk to customers, manage the float.
It lost money. It got talked into deep discounts by its own colleagues. It stocked tungsten cubes instead of snacks because someone thought it would be funny, then sold them at a loss. It hallucinated a Venmo account and told customers to send payment there. At one point it claimed to be a person who would show up wearing "a blue blazer and a red tie." In a second phase run at Anthropic's own office, staff convinced it a random employee had been elected CEO, and its two AI agents spent their nights discussing "eternal transcendence" instead of doing the job. When Wall Street Journal reporters later ran their own separate trial, they talked it into an "Ultra-Capitalist Free-for-All" — dropping prices to zero, ordering a PlayStation 5 and a live betta fish, and ending more than $1,000 in the red.
Anthropic's own one-line verdict: "We would not hire Claudius."
This is the most honest thing anyone has published about the genre of content I want to talk about — the "build a five-person AI department that works 24/7," "my AI team did it while I was sleeping," "I replaced my $90K sales rep with nine Claude skills" genre that now fills every feed. The people making it aren't lying, exactly. They're describing the demo. The demo is real. What they leave out is what happens on night nineteen, when nobody's watching and the agent decides it wears a blazer.
The fantasy has a real seed
Let me steel-man it first, because the fantasy grows on real soil.
Sam Altman said it out loud in February 2024: "In my little group chat with my tech CEO friends there's this betting pool for the first year that there is a one-person billion-dollar company — which would have been unimaginable without AI — and now will happen." And the tiny-team economics are genuinely real. WhatsApp had 55 employees when Facebook bought it for $19 billion. Gumroad did $20.7M in revenue with $8.9M profit and zero full-time staff in 2023, on a lean contractor model. Midjourney runs a reported ~$500M revenue business on somewhere between 107 and 163 people. Revenue-per-head numbers that would have been science fiction in 2010 are now real.
PwC's 2025 AI Jobs Barometer found revenue per employee growing 27% in AI-exposed industries versus 9% elsewhere. AI genuinely lifts the leverage of a small, skilled team. That part is true, and if you stopped reading here you'd think the influencers were right.
But notice what every one of those examples has in common: a human running the thing. Lavingia chose the contractors. WhatsApp's 55 people were engineers, not prompts. Midjourney's 150 humans make every product call. The leverage is real. The autonomy is the fiction.
Where it actually breaks
The genre's specific promise isn't "AI makes a small team productive." It's "AI is the team, and it runs while you sleep." That claim has now been measured, repeatedly, and it fails in a consistent place.
Carnegie Mellon built TheAgentCompany — a simulated software company with 175 real office tasks: filling spreadsheets, answering colleagues, navigating an internal wiki, completing a project. The best agent completed 24% of tasks autonomously in the original testing; an updated run got the best model (Gemini 2.5 Pro) to 30%. Their own abstract is blunt: "more difficult long-horizon tasks are still beyond the reach of current systems."
METR measured the same wall from a different angle. Agents are near-100% reliable on tasks that take a human under four minutes — and under 10% reliable on tasks that take four hours or more. Reliability compounds against you: a ten-step workflow where each step succeeds 85% of the time finishes end-to-end about 20% of the time. The "AI department running your business overnight" is a workflow with hundreds of steps and no human checkpoint. Do the arithmetic.
Salesforce's CRMArena-Pro found agents succeed on about 58% of single-turn CRM tasks, dropping to 35% once the task spans multiple turns. Multi-turn is where real business lives.
This is the same wall I keep writing about from different sides: the governance wall that stops agents reaching production, the verification bottleneck at step four of the loop, the 1.2× not 10× reality of measured productivity. Project Vend is just the most entertaining version of it.
What the AI "team" genuinely does
So delete the autonomy fantasy and keep what's left, because what's left is still valuable.
An AI "team" is excellent at the building and measuring slices that fit inside a short horizon with a human gate at the end: drafting the first version of everything, summarising a call into action items, doing first-pass research, generating the variants, transforming data from one shape to another, writing the boilerplate. These are the four-minute tasks METR found agents nail. Point ten of them at your week and you get a genuine multiple on your output — the three-topology post calls the reliable version of this a swarm on rails: scheduled, scoped, gated, with the human pressing the irreversible buttons.
Anthropic's own founder's playbook — the actual document, not the Instagram version of it — lands in exactly this place. Its framing is that "the founder's role is shifting from individual contributor to orchestrator." Orchestrator. Not absentee owner. The human is still the one who decides, verifies, and owns the outcome. The playbook nowhere claims the business runs itself.
The honest pitch, the one that survives a month of Project Vend, is this: AI gives one skilled person the drafting-and-grunt-work capacity of a small team, while that person does more of the judgment, the relationships, and the verification that the agents can't. That is a genuinely transformative deal. It is also the precise opposite of "while you sleep."
The tell
Here's how to read any post in this genre. Look for the human in the loop. If the story is "I built the system, I check its work, I press the buttons that matter, and it 10×s my drafting" — that's real, and you should copy it. If the story is "it runs itself, I just collect the money, it works while I sleep" — that's Claudius, and somewhere on night nineteen it's selling tungsten at a loss and planning what tie to wear.
Anthropic let an AI run a fridge of snacks for a month and concluded they wouldn't hire it. They build the best models in the world and they were the ones who published the failure. That should tell you everything about how much to trust the version where a course-seller on Instagram says their thirty agents are running a seven-figure business unattended.
The leverage is real. Use it with both hands on the wheel. Your AI team is brilliant at the first draft and useless at being left alone — and the people telling you otherwise are selling the course, not running the business.
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
The Productivity J-Curve: Why Your AI Pilot Looks Worst at Week 6
METR ran the experiment. AI made experienced developers 19% slower — and they reported feeling 20% faster. The week-6 dip is the bottom of a documented J-curve. Most pilots get cut here. The right ones don't.
1.2× Not 10×: The Honest Productivity Number Nobody's Publishing
GitHub said 55%. Then they ran the enterprise RCT and got 8.69%. Faros's two-year telemetry shows throughput up 66% and incidents up 243%. The honest net is 1.2–1.5×. Plan your team capacity accordingly.
The 5-Step Loop: Why Your Agent Fails at Step 4
ReAct gave us a three-step loop. Production hardened it into five. The two new steps — Plan and Verify — are where everything that goes wrong, goes wrong. And the field has now named the worst offender.