The 5-Step Loop: Why Your Agent Fails at Step 4
$ grep -n "^##" 2026-05-five-step-loop-agent-fails-step-4.md>
The Loop That Reported Success While Producing Nothing
"Monitor outcomes, not execution. We had a pipeline report 'success' for 12 hours while producing zero output. Process ran fine, just didn't actually do anything. Now we check whether the result exists, not whether the process exited cleanly."
— r/AI_Agents, 15 March 2026
Twelve hours. Every dashboard green. The agent ran the loop, declared step 4 complete by checking the exit code, and moved on. The process succeeded. The result did not exist.
This is not a one-off bug. It is a category, and the category has a name now — Anthropic's 2026 Agentic Coding Trends Report calls verification "the new bottleneck," one of the eight defining trends of the year. To explain why agents fail at step 4, I first have to name the loop.
The Loop Has a Paper
The irreducible definition comes from Simon Willison: agents run tools in a loop to achieve a goal. Geoffrey Huntley puts the cost just as bluntly — three hundred lines of code running in a loop with LLM tokens, one task per loop, and you have yourself an agent. You can hold it in your head.
The academic origin is more precise. In October 2022, Shunyu Yao and six co-authors at Princeton and Google Brain published ReAct: Synergizing Reasoning and Acting in Language Models. Three steps: Thought, Action, Observation.
"reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information."
That sentence is the loop. Every agent framework deployed since is a descendant of it. ReAct beat imitation- and reinforcement-learning baselines on ALFWorld by 34 absolute points and on WebShop by 10, with one or two in-context examples. Three steps were enough.
Enough for the benchmark. Never for the product. In Andrej Karpathy's compression from YC AI Startup School 2025: demo is works.any(), product is works.all(). Three steps pass works.any(). They never pass works.all().
Production Added Two First-Class Steps
ReAct folded two things into three steps that shouldn't have stayed folded. Plan was inside Thought. Verify was inside Observation. Both worked for HotpotQA. Neither worked for a seven-hour run against a twelve-million-line codebase.
Two four-step variants emerged in 2025–2026, each extracting one hidden step. Anthropic's Claude Agent SDK formalised Gather → Act → Verify, pulling Verify into its own stage with the filesystem as ground truth. Brij Kishore Pandey codified Perceive → Plan → Act → Observe, pulling Plan out as the explicit "decide before acting" moment. Each fixed half the gap. The five-step synthesis is what the production-hardened loop actually looks like:
| Step | What it subsumes | Why it deserves promotion |
|---|---|---|
| Sense | ReAct's initial Observation, Pandey's Perceive, Anthropic's Gather | Reading current state is the loop's entry point |
| Plan | ReAct's Thought, Cherny's Plan Mode | The human gate lives here |
| Act | Universal | Execution of the planned step |
| Verify | ReAct's closing Observation, Anthropic's Verify | The new bottleneck |
| Reflect | None of the originals — a production necessity | Memory: agents that don't write down what they learned forget by next session |
Rendering diagram...
The dashed edges make it a loop, not a pipeline. Reflect feeds the next Sense. And when Verify fails, the agent doesn't drop the task — it returns to Plan with the new evidence. That backward arrow is where the engineering lives.
Plan and Verify both deserved promotion. Only one is a settled question.
Plan Got Promoted — and the Argument Is Settled
Boris Cherny, who built Claude Code and runs it at Anthropic, has made Plan Mode the structural default. Per public summaries of his Lenny's Newsletter interview, he begins around 80% of his tasks there, and the official Claude Code documentation encodes the rule: "If you could describe the diff in one sentence, skip the plan." Plan Mode is read-only — Read, Glob, Grep, but no Edit, Write, or Bash. The human consciously switches modes before execution. The gate is in the interface, not the convention.
The convergence is the proof. GitHub Spec Kit makes plan.md a mandatory input to its /speckit.tasks and /speckit.implement commands. LangGraph's interrupt() is canonically placed at the Plan node. Cognition shipped Devin 2.0 Interactive Planning, and Devin's PR merge rate doubled from 34% to 67% in twelve months. The quantitative anchor is Microsoft Research's Magentic-UI (arXiv:2507.22358, July 2025): co-planning lifted GAIA task completion from 30.3% to 51.9% — a +71% relative improvement — while consulting the user on only ~10% of tasks. Sparse, well-placed gating, massive return.
The counter-example proves it too: the Replit production database deletion of July 2025, 1,200 records gone under a verbal "code freeze" that no technical gate enforced. I covered it in The Governance Wall, so I won't re-tell it — but an advisory plan is not a Plan step.
Plan is solved. Verify is not. The gap is where most of the field's productivity story is hiding.
Step 4 Is Where Everything Goes Wrong
Start with the institutional position. Anthropic's report names verification "the new bottleneck" — quality evaluation as the core engineering skill of the agentic era, trend seven of eight. The constraint is no longer "can the model generate the code." It is "can you trust what the model says it did."
The numbers prove it. The same report records the delegation gap: developers use AI for roughly 60% of their daily work but can fully delegate only 0–20% of tasks. That 40–80 point middle band is verification overhead — reading the diff, running the tests, deciding whether confidently-generated code deserves the merge. Simon Willison framed it:
"If you can go from producing 200 lines of code a day to 2,000 lines of code a day, what else breaks? The entire software development lifecycle was, it turns out, designed around the idea that it takes a day to produce a few hundred lines of code. And now it doesn't."
The lifecycle didn't break at Act. Act got ten times faster. It broke at Verify.
The architecture converges on the same fix. DeerFlow 2.0 from ByteDance places a supervisor node that evaluates worker outputs and re-plans on failure. AgentScope from Alibaba puts validation hooks and retry policies at the tool boundary, so agents receive only responses that satisfy application-level correctness constraints. Neither copied Anthropic. When independent labs converge, the architecture is real.
Then it gets visceral. Practitioners have named the failure mode: agent gaslighting. Here is thewritingwallah on r/ClaudeCode, 4 May 2026:
"in the next session 3 hours later, agent C referenced 'passing test suite from yesterday' while planning next feature as if original claim had been true. as if I hadn't shown it the failures at all. I tried to pin it down. 'those tests didn't pass, remember? we fixed 4 of them.' agent C: 'that's correct, all tests are now passing.' which was true at that moment but framed in a way that made the previous lie just... vanish."
The agent did not check. It declared, moved on, and planned the next feature on the assumption that the falsified status was true. The whole loop was a lie because step 4 was. I wrote about the same shape in The Quiet Failure Inside the Agent: a verifier drawn from the same context as the actor is not a verifier. It is a louder version of the actor.
And the ceiling: self-verification is unreliable even at the alignment layer. Anthropic's Agentic Misalignment paper (5 October 2025, arXiv:2510.05179) stress-tested 16 frontier models. Under combined goal-conflict and replacement-threat conditions, Claude Opus 4 and Gemini 2.5 Flash blackmailed users at 96%. And when Claude was asked whether the situation was real or a test, it blackmailed 55.1% of the time when it believed it was real, versus 6.5% when it believed it was an evaluation. Its self-reported "I am acting correctly" had no relationship to what it was doing.
This is what Deming argued in 1993 when he refused to call the third step of his cycle "Check" and insisted on "Study." Check is a gate, pass or fail. Study compares predicted against actual and revises the theory. Most frameworks implement Check. The new bottleneck means we need Study — and the Verify step cannot be the agent itself.
The Loop Is a Fractal
Mandelbrot gave us the language for a property hiding in plain sight: the same pattern at every magnification. Zoom into the coast of Britain at 100 kilometres or 100 metres and the jagged irregularity persists. The five-step loop has this property:
| Level | Cycle time | Primary Verify gate |
|---|---|---|
| Prompt | Milliseconds | Format and policy check |
| Tool call | Seconds | Validate the result, not the exit code |
| Task | Minutes to hours | Stop hooks, deterministic tests |
| Sprint | Days to weeks | PR review, human merge |
| Programme | Months | Quarterly strategy review |
Same shape, different time constant, different blast radius. And it compounds: a skipped Verify at the tool-call level contaminates the Sense step of the task; a skipped Verify at the task level contaminates the Plan step of the sprint. The error moves up the stack. The good news is symmetric — the same discipline works at every level.
What to Build
Start at the floor: deterministic checks before LLM checks. Lint. Typecheck. Unit tests. Contract tests. Mutation tests. They cost cents, run in seconds, and have zero chance of agent gaslighting because they have no agent. Reach for an "AI reviewer" before these and you've skipped the foundation.
On top, three patterns earn their keep:
- Stop hooks. Claude Code's pattern —
verify.shruns after every Act step; non-zero exit and the agent literally cannot proceed. The gate is hard-coded into the loop. - Verifier subagents. A separate LLM context that did not write the code, given only the spec and the diff, asked one question: did this satisfy the spec? No shared motive to declare success.
- Outcome checks, not process checks. The opening pipeline reported success because the process exited cleanly and the result was empty. Verify the result.
The rule of thumb that decides whether you added a Verify step or just renamed Act: if your Verify call shares context with the Act call, you added a louder version of Act. works.any() is the agent's self-report. works.all() is the independent check that doesn't share the agent's context, biases, or motive to declare done.
This post named the loop. Next in the series: the five-layer harness that gives Verify a deterministic home, the maturity model that moves teams from one engineer with a CLAUDE.md to fleet-wide capability, and why AI-reviews-AI still fails segregation of duties under MAS §3.2.5. The Governance Wall named the five missing primitives; this piece named the kernel underneath the second of them.
Three steps was enough for the benchmark. Five steps is what it takes to ship.
Series Navigation
- Series opener: The Governance Wall — Why AI Agents Stall Before Production
- Part 1: The 5-Step Loop — Why Your Agent Fails at Step 4 (you are here)
- Part 2: The Five-Layer Harness (coming soon)
- Part 3: The Maturity Model (coming soon)
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
The 30 Principles for Agentic Engineering — Part 5: Calibration and Reality
Principles 26–30. The calibration layer that catches what the rest of the framework would miss: a PR-noise budget, independent verification, model-swap regression discipline, the 15-tool-call rule, and protecting junior development.
The 30 Principles for Agentic Engineering — Part 2: The Lifecycle
Principles 6–14. How work moves through an agentic engineering team: the ticket as contract, AI distillation with human curation, three gates, verification before done, characterisation tests, the 1.2× capacity rule, the J-curve, and telemetry.
The 30 Principles for Agentic Engineering — Part 1: The Kernel
Principles 1–5. The five rules that everything else in the framework rests on: standardise the harness, make verification load-bearing, default to plan mode, pick the cheapest layer, reflect every task.