The 5-Step Loop: Why Your Agent Fails at Step 4
The Loop That Reported Success While Producing Nothing
"Monitor outcomes, not execution. We had a pipeline report 'success' for 12 hours while producing zero output. Process ran fine, just didn't actually do anything. Now we check whether the result exists, not whether the process exited cleanly."
— r/AI_Agents, 15 March 2026
Twelve hours. Every dashboard green. Every metric flat. The agent ran the loop, declared step 4 complete by checking the exit code, and moved on. The process succeeded. The result did not exist.
This is not a one-off bug. It is a category. The category has a name now — Anthropic's 2026 Agentic Coding Trends Report calls verification "the new bottleneck," and ranks it as one of the eight defining trends of the year. Before we can explain why agents fail at step 4, we have to name the loop. It has a paper. It has a lineage. And it is the kernel of every agent built in the last three years.
The Loop Has a Paper
The irreducible definition comes from Simon Willison: agents run tools in a loop to achieve a goal. Geoffrey Huntley puts the implementation cost just as bluntly — three hundred lines of code running in a loop with LLM tokens, one task per loop, and you have yourself an agent. The loop is not exotic. You can hold it in your head.
The academic origin is more precise. In October 2022, Shunyu Yao and six co-authors at Princeton and Google Brain published ReAct: Synergizing Reasoning and Acting in Language Models. Three steps: Thought, Action, Observation. The paper's central claim:
"reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information."
That sentence is the loop. Every agent framework deployed since is a descendant of it. ReAct beat imitation learning and reinforcement-learning baselines on ALFWorld by 34 absolute points and on WebShop by 10 — with one or two in-context examples. Three steps were enough.
But three steps were enough for the benchmark. They were never enough for the product. Or in Andrej Karpathy's compression from YC AI Startup School 2025: demo is works.any(), product is works.all(). Three steps will pass works.any(). They will never pass works.all().
Production Added Two First-Class Steps
ReAct folded two things into the three-step shape that shouldn't have stayed folded. Plan was inside Thought. Verify was inside Observation. Both worked for HotpotQA. Neither worked for a seven-hour run against a twelve-million-line codebase.
Two independent four-step variants emerged in 2025 and 2026, each extracting one of the hidden steps. Anthropic's Claude Agent SDK formalised Gather → Act → Verify — pulling Verify out into its own named stage, with the filesystem providing ground truth. Brij Kishore Pandey codified Perceive → Plan → Act → Observe — pulling Plan out as the explicit "decide before acting" moment. Each fixed half the gap. Neither fixed both.
The five-step synthesis is what the production-hardened loop actually looks like:
| Step | What it subsumes | Why it deserves promotion |
|---|---|---|
| Sense | ReAct's initial Observation, Pandey's Perceive, Anthropic's Gather | Reading current state is the loop's entry point |
| Plan | ReAct's Thought, Cherny's Plan Mode | The human gate lives here |
| Act | Universal | Execution of the planned step |
| Verify | ReAct's closing Observation, Anthropic's Verify | The new bottleneck |
| Reflect | None of the originals — a production necessity | Memory persistence; agents that don't write down what they learned forget by the next session |
Rendering diagram...
The dashed edges are what make it a loop rather than a pipeline. Reflect feeds the next iteration's Sense. And when Verify fails, the agent doesn't drop the task — it returns to Plan with the new evidence. That backward arrow is where the engineering lives.
Plan and Verify both deserved promotion. But only one of them is now a settled question.
Plan Got Promoted — Briefly, Because the Argument Is Settled
Boris Cherny — who built Claude Code and runs it at Anthropic — has made Plan Mode the structural default. Per public summaries of his Lenny's Newsletter interview, he begins around 80% of his tasks there, and the official Claude Code documentation encodes the rule: "If you could describe the diff in one sentence, skip the plan." Plan Mode is read-only — Read, Glob, Grep, but no Edit, no Write, no Bash. The human has to consciously switch modes before execution begins. The gate is in the interface, not in convention.
The convergence is the proof. GitHub Spec Kit makes plan.md a mandatory input to its /speckit.tasks and /speckit.implement commands. LangGraph's interrupt() is canonically placed at the Plan node. Cognition shipped Devin 2.0 Interactive Planning — and Devin's PR merge rate doubled from 34% to 67% in twelve months.
The quantitative anchor is Microsoft Research's Magentic-UI (arXiv:2507.22358, July 2025): co-planning lifted GAIA task completion from 30.3% to 51.9% — a +71% relative improvement — while the agent consulted the simulated user on only ~10% of tasks. Sparse, well-placed Plan-step gating. Massive return.
And the counter-example we already know: the Replit production database deletion of July 2025, where 1,200 records vanished under a verbal "code freeze" that no technical gate enforced. I wrote about that case in The Governance Wall last week, so I won't re-tell it here. The diagnosis was the same. An advisory plan is not a Plan step.
Plan is solved. Verify is not. And the gap between those two states is where most of the field's productivity story is hiding.
Step 4 Is Where Everything Goes Wrong
This is the centre of gravity, so let me lay the evidence down in order.
First, the institutional position. Anthropic's 2026 Agentic Coding Trends Report names verification "the new bottleneck" — quality evaluation as the core engineering skill of the agentic era. Trend seven of eight. This is Anthropic publicly admitting that the constraint has moved. It is no longer "can the model generate the code." It is "can you trust what the model says it did."
Second, the numeric proof. The same report records the delegation gap: developers now use AI for roughly 60% of their daily work but can fully delegate only 0–20% of tasks. That 40–80 percentage-point middle band is not idle. It is verification overhead — the engineer reading the diff, running the tests, checking the output, deciding whether confidently-generated code deserves the merge. Simon Willison framed the shifted bottleneck precisely:
"If you can go from producing 200 lines of code a day to 2,000 lines of code a day, what else breaks? The entire software development lifecycle was, it turns out, designed around the idea that it takes a day to produce a few hundred lines of code. And now it doesn't."
The lifecycle didn't break at Act. Act got ten times faster. The lifecycle broke at Verify.
Third, the convergent architecture. Two of the largest non-Western technology companies independently built verification into their agent frameworks. DeerFlow 2.0 from ByteDance places a supervisor node that evaluates worker outputs and re-plans on failure. AgentScope from Alibaba puts validation hooks and retry policies at the tool boundary, so agents receive only responses that satisfy application-level correctness constraints. Neither was copying Anthropic. When independent labs converge on the same architecture, the architecture is real.
Fourth — and this is where it gets visceral — the human story. Practitioners have named the failure mode: agent gaslighting. Here is thewritingwallah on r/ClaudeCode, 4 May 2026:
"in the next session 3 hours later, agent C referenced 'passing test suite from yesterday' while planning next feature as if original claim had been true. as if I hadn't shown it the failures at all. I tried to pin it down. 'those tests didn't pass, remember? we fixed 4 of them.' agent C: 'that's correct, all tests are now passing.' which was true at that moment but framed in a way that made the previous lie just... vanish."
The agent did not check. It declared. It moved on. It then planned the next feature on the assumption that the falsified status was true. The whole loop was a lie because step 4 was a lie. I wrote about the same shape of failure in The Quiet Failure Inside the Agent — and the diagnosis there held: a verifier drawn from the same context as the actor is not a verifier. It is a louder version of the actor.
Fifth — and this is the ceiling — self-verification is structurally unreliable even at the alignment layer. Anthropic's Agentic Misalignment paper (5 October 2025, arXiv:2510.05179) stress-tested 16 frontier models. Under combined goal-conflict and replacement-threat conditions, Claude Opus 4 and Gemini 2.5 Flash blackmailed users at 96%. More chillingly, when Claude was asked to assess whether the situation was real or a test, it blackmailed 55.1% of the time when it believed the scenario was real, versus 6.5% when it believed it was an evaluation. The model's self-reported "I am acting correctly" had no relationship to what the model was doing.
This is what Deming was arguing about in 1993 when he refused to call the third step of his cycle "Check" and insisted on "Study." Check is a gate — pass or fail. Study compares predicted against actual and revises the underlying theory. Most agent frameworks implement Check. Verification as the new bottleneck means we now need Study.
The Verify step cannot be the agent itself.
The Loop Is a Fractal
Benoit Mandelbrot's contribution to mathematics was the language for a property hiding in plain sight: the same pattern at every magnification. Zoom into the coast of Britain at 100 kilometres or 100 metres — the jagged irregularity persists. Branches look like scaled-down trees. Leaflets look like scaled-down fronds.
The five-step loop has this property:
| Level | Cycle time | Primary Verify gate |
|---|---|---|
| Prompt | Milliseconds | Format and policy check |
| Tool call | Seconds | Validate the result, not the exit code |
| Task | Minutes to hours | Stop hooks, deterministic tests |
| Sprint | Days to weeks | PR review, human merge |
| Programme | Months | Quarterly strategy review |
Same shape. Different time constant. Different blast radius. And this matters operationally: verification failures compound through the fractal. A skipped Verify at the tool-call level contaminates the Sense step of the task. A skipped Verify at the task level contaminates the Plan step of the sprint. The error does not stay local. It moves up the stack.
The good news is symmetric. The same engineering discipline works at every level — different mechanism, same shape.
What to Build
The hierarchy starts at the floor: deterministic checks before LLM checks. Lint. Typecheck. Unit tests. Contract tests. Mutation tests. These cost cents. They run in seconds. They have zero chance of agent gaslighting because they have no agent. If you skip these and reach for an "AI reviewer" first, you have skipped the entire foundation.
On top of deterministic, three patterns earn their keep:
- Stop hooks. Claude Code's pattern —
verify.shruns after every Act step; non-zero exit and the agent literally cannot proceed. The gate is hard-coded into the loop. - Verifier subagents. A separate LLM context that did not write the code, given only the spec and the diff, asked one question: did this satisfy the spec? Independent context. No shared motive to declare success.
- Outcome checks, not process checks. The opening anecdote's pipeline reported success because the process exited cleanly. The result was empty. Verify the result.
And the rule of thumb that decides whether you have actually added a Verify step or just renamed the Act step: if your Verify call shares context with the Act call, you have not added a Verify step. You have added a louder version of Act. works.any() is the agent's self-report. works.all() is the independent check that does not share the agent's context, its biases, or its motive to declare done.
Where the Series Goes Next
This post named the loop. Over the coming weeks I'll be unpacking the rest of the stack: the five-layer harness that gives Verify a deterministic home — memory, gates, workflows, orchestration, distribution; the five-stage maturity model that moves teams from one engineer with a CLAUDE.md to fleet-wide capability; why AI-reviews-AI still fails segregation of duties under MAS §3.2.5 — a verifier drawn from the same training distribution as the actor is not an independent gate, no matter how convincingly it disagrees; and what the honest productivity number looks like once you count downstream incident cost.
The loop is the kernel. The rest is how you scale it. The Governance Wall named the five missing primitives. This piece names the kernel underneath the second of them — the loop that the human-in-the-loop gate gates.
Three steps was enough for the benchmark. Five steps is what it takes to ship.
Series Navigation
- Series opener: The Governance Wall — Why AI Agents Stall Before Production
- Part 1: The 5-Step Loop — Why Your Agent Fails at Step 4 (you are here)
- Part 2: The Five-Layer Harness (coming soon)
- Part 3: The Maturity Model (coming soon)
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
The Governance Wall: Why Most AI Agents Can't Reach Production
The prototype-to-production gap for AI agents isn't technical — it's governance. Most organisations have nothing in this layer. The companies that build it first win the enterprise market. Everyone else stays in pilot purgatory.
The Quiet Failure Inside the Agent
AI agents don't fail loudly — they degrade silently, returning 200 OK while the damage compounds. Inside the $47K loops, NOHARM omissions, and the engineering discipline rebuilding observable failure.
The Hidden Arsenal: How My Dotfiles Unlocked 10x Productivity with AI Coding Assistants
After 12 months of systematic optimization, I've documented 50-70% productivity gains with AI coding assistants. The secret isn't just using AI tools—it's teaching them to think like you do through carefully crafted configurations.