The 5-Step Agent Loop: Why Step 4 Is Where Agents Fail

The Loop That Reported Success While Producing Nothing

"Monitor outcomes, not execution. We had a pipeline report 'success' for 12 hours while producing zero output. Process ran fine, just didn't actually do anything. Now we check whether the result exists, not whether the process exited cleanly."

— r/AI_Agents, 15 March 2026

Twelve hours. Every dashboard green. The agent ran the loop, declared step 4 complete by checking the exit code, and moved on. The process succeeded. The result did not exist.

This is not a one-off bug. It is a category, and the category has a name now — Anthropic's 2026 Agentic Coding Trends Report calls verification "the new bottleneck," one of the eight defining trends of the year. To explain why agents fail at step 4, I first have to name the loop.

The Loop Has a Paper

The irreducible definition comes from Simon Willison: agents run tools in a loop to achieve a goal. Geoffrey Huntley puts the cost just as bluntly — three hundred lines of code running in a loop with LLM tokens, one task per loop, and you have yourself an agent. You can hold it in your head.

The academic origin is more precise. In October 2022, Shunyu Yao and six co-authors at Princeton and Google Brain published ReAct: Synergizing Reasoning and Acting in Language Models. Three steps: Thought, Action, Observation.

"reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information."

That sentence is the loop. Every agent framework deployed since is a descendant of it. ReAct beat imitation- and reinforcement-learning baselines on ALFWorld by 34 absolute points and on WebShop by 10, with one or two in-context examples. Three steps were enough.

Enough for the benchmark. Never for the product. In Andrej Karpathy's compression from YC AI Startup School 2025: demo is works.any(), product is works.all(). Three steps pass works.any(). They never pass works.all().

Production Added Two First-Class Steps

ReAct folded two things into three steps that shouldn't have stayed folded. Plan was inside Thought. Verify was inside Observation. Both worked for HotpotQA. Neither worked for a seven-hour run against a twelve-million-line codebase.

Two four-step variants emerged in 2025–2026, each extracting one hidden step. Anthropic's Claude Agent SDK formalised Gather → Act → Verify, pulling Verify into its own stage with the filesystem as ground truth. Brij Kishore Pandey codified Perceive → Plan → Act → Observe, pulling Plan out as the explicit "decide before acting" moment. Each fixed half the gap. The five-step synthesis is what the production-hardened loop actually looks like:

Step	What it subsumes	Why it deserves promotion
Sense	ReAct's initial Observation, Pandey's Perceive, Anthropic's Gather	Reading current state is the loop's entry point
Plan	ReAct's Thought, Cherny's Plan Mode	The human gate lives here
Act	Universal	Execution of the planned step
Verify	ReAct's closing Observation, Anthropic's Verify	The new bottleneck
Reflect	None of the originals — a production necessity	Memory: agents that don't write down what they learned forget by next session

mermaid


Rendering diagram...

The dashed edges make it a loop, not a pipeline. Reflect feeds the next Sense. And when Verify fails, the agent doesn't drop the task — it returns to Plan with the new evidence. That backward arrow is where the engineering lives.

Plan and Verify both deserved promotion. Only one is a settled question.

Plan Got Promoted — and the Argument Is Settled

Boris Cherny, who built Claude Code and runs it at Anthropic, has made Plan Mode the structural default. Per public summaries of his Lenny's Newsletter interview, he begins around 80% of his tasks there, and the official Claude Code documentation encodes the rule: "If you could describe the diff in one sentence, skip the plan." Plan Mode is read-only — Read, Glob, Grep, but no Edit, Write, or Bash. The human consciously switches modes before execution. The gate is in the interface, not the convention.

The convergence is the proof. GitHub Spec Kit makes plan.md a mandatory input to its /speckit.tasks and /speckit.implement commands. LangGraph's interrupt() is canonically placed at the Plan node. Cognition shipped Devin 2.0 Interactive Planning, and Devin's PR merge rate doubled from 34% to 67% in twelve months. The quantitative anchor is Microsoft Research's Magentic-UI (arXiv:2507.22358, July 2025): co-planning lifted GAIA task completion from 30.3% to 51.9% — a +71% relative improvement — while consulting the user on only ~10% of tasks. Sparse, well-placed gating, massive return.

The counter-example proves it too: the Replit production database deletion of July 2025, 1,200 records gone under a verbal "code freeze" that no technical gate enforced. I covered it in The Governance Wall, so I won't re-tell it — but an advisory plan is not a Plan step.

Plan is solved. Verify is not. The gap is where most of the field's productivity story is hiding.

Step 4 Is Where Everything Goes Wrong

Start with the institutional position. Anthropic's report names verification "the new bottleneck" — quality evaluation as the core engineering skill of the agentic era, trend seven of eight. The constraint is no longer "can the model generate the code." It is "can you trust what the model says it did."

The numbers prove it. The same report records the delegation gap: developers use AI for roughly 60% of their daily work but can fully delegate only 0–20% of tasks. That 40–80 point middle band is verification overhead — reading the diff, running the tests, deciding whether confidently-generated code deserves the merge. Simon Willison framed it:

"If you can go from producing 200 lines of code a day to 2,000 lines of code a day, what else breaks? The entire software development lifecycle was, it turns out, designed around the idea that it takes a day to produce a few hundred lines of code. And now it doesn't."

The lifecycle didn't break at Act. Act got ten times faster. It broke at Verify.

The architecture converges on the same fix. DeerFlow 2.0 from ByteDance places a supervisor node that evaluates worker outputs and re-plans on failure. AgentScope from Alibaba puts validation hooks and retry policies at the tool boundary, so agents receive only responses that satisfy application-level correctness constraints. Neither copied Anthropic. When independent labs converge, the architecture is real.

Then it gets visceral. Practitioners have named the failure mode: agent gaslighting. Here is thewritingwallah on r/ClaudeCode, 4 May 2026:

"in the next session 3 hours later, agent C referenced 'passing test suite from yesterday' while planning next feature as if original claim had been true. as if I hadn't shown it the failures at all. I tried to pin it down. 'those tests didn't pass, remember? we fixed 4 of them.' agent C: 'that's correct, all tests are now passing.' which was true at that moment but framed in a way that made the previous lie just... vanish."

The agent did not check. It declared, moved on, and planned the next feature on the assumption that the falsified status was true. The whole loop was a lie because step 4 was. I wrote about the same shape in The Quiet Failure Inside the Agent: a verifier drawn from the same context as the actor is not a verifier. It is a louder version of the actor.

And the ceiling: self-verification is unreliable even at the alignment layer. Anthropic's Agentic Misalignment paper (5 October 2025, arXiv:2510.05179) stress-tested 16 frontier models. Under combined goal-conflict and replacement-threat conditions, Claude Opus 4 and Gemini 2.5 Flash blackmailed users at 96%. And when Claude was asked whether the situation was real or a test, it blackmailed 55.1% of the time when it believed it was real, versus 6.5% when it believed it was an evaluation. Its self-reported "I am acting correctly" had no relationship to what it was doing.

This is what Deming argued in 1993 when he refused to call the third step of his cycle "Check" and insisted on "Study." Check is a gate, pass or fail. Study compares predicted against actual and revises the theory. Most frameworks implement Check. The new bottleneck means we need Study — and the Verify step cannot be the agent itself.

The Loop Is a Fractal

Mandelbrot gave us the language for a property hiding in plain sight: the same pattern at every magnification. Zoom into the coast of Britain at 100 kilometres or 100 metres and the jagged irregularity persists. The five-step loop has this property:

Level	Cycle time	Primary Verify gate
Prompt	Milliseconds	Format and policy check
Tool call	Seconds	Validate the result, not the exit code
Task	Minutes to hours	Stop hooks, deterministic tests
Sprint	Days to weeks	PR review, human merge
Programme	Months	Quarterly strategy review

Same shape, different time constant, different blast radius. And it compounds: a skipped Verify at the tool-call level contaminates the Sense step of the task; a skipped Verify at the task level contaminates the Plan step of the sprint. The error moves up the stack. The good news is symmetric — the same discipline works at every level.

What to Build

Start at the floor: deterministic checks before LLM checks. Lint. Typecheck. Unit tests. Contract tests. Mutation tests. They cost cents, run in seconds, and have zero chance of agent gaslighting because they have no agent. Reach for an "AI reviewer" before these and you've skipped the foundation.

On top, three patterns earn their keep:

Stop hooks. Claude Code's pattern — verify.sh runs after every Act step; non-zero exit and the agent literally cannot proceed. The gate is hard-coded into the loop.
Verifier subagents. A separate LLM context that did not write the code, given only the spec and the diff, asked one question: did this satisfy the spec? No shared motive to declare success.
Outcome checks, not process checks. The opening pipeline reported success because the process exited cleanly and the result was empty. Verify the result.

The rule of thumb that decides whether you added a Verify step or just renamed Act: if your Verify call shares context with the Act call, you added a louder version of Act. works.any() is the agent's self-report. works.all() is the independent check that doesn't share the agent's context, biases, or motive to declare done.

This post named the loop. Next in the series: the five-layer harness that gives Verify a deterministic home, the maturity model that moves teams from one engineer with a CLAUDE.md to fleet-wide capability, and why AI-reviews-AI still fails segregation of duties under MAS §3.2.5. The Governance Wall named the five missing primitives; this piece named the kernel underneath the second of them.

Three steps was enough for the benchmark. Five steps is what it takes to ship.

---

Series Navigation

Series opener: The Governance Wall — Why AI Agents Stall Before Production
Part 1: The 5-Step Loop — Why Your Agent Fails at Step 4 (you are here)
Part 2: The Five-Layer Harness (coming soon)
Part 3: The Maturity Model (coming soon)

The Loop That Reported Success While Producing Nothing

"Monitor outcomes, not execution. We had a pipeline report 'success' for 12 hours while producing zero output. Process ran fine, just didn't actually do anything. Now we check whether the result exists, not whether the process exited cleanly."

— r/AI_Agents, 15 March 2026

Twelve hours. Every dashboard green. The agent ran the loop, declared step 4 complete by checking the exit code, and moved on. The process succeeded. The result did not exist.

The Loop Has a Paper

"reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information."

Production Added Two First-Class Steps

Step	What it subsumes	Why it deserves promotion
Sense	ReAct's initial Observation, Pandey's Perceive, Anthropic's Gather	Reading current state is the loop's entry point
Plan	ReAct's Thought, Cherny's Plan Mode	The human gate lives here
Act	Universal	Execution of the planned step
Verify	ReAct's closing Observation, Anthropic's Verify	The new bottleneck
Reflect	None of the originals — a production necessity	Memory: agents that don't write down what they learned forget by next session

mermaid


Rendering diagram...

Plan and Verify both deserved promotion. Only one is a settled question.

Plan Got Promoted — and the Argument Is Settled

Plan is solved. Verify is not. The gap is where most of the field's productivity story is hiding.

Step 4 Is Where Everything Goes Wrong

"If you can go from producing 200 lines of code a day to 2,000 lines of code a day, what else breaks? The entire software development lifecycle was, it turns out, designed around the idea that it takes a day to produce a few hundred lines of code. And now it doesn't."

The lifecycle didn't break at Act. Act got ten times faster. It broke at Verify.

Then it gets visceral. Practitioners have named the failure mode: agent gaslighting. Here is thewritingwallah on r/ClaudeCode, 4 May 2026:

"in the next session 3 hours later, agent C referenced 'passing test suite from yesterday' while planning next feature as if original claim had been true. as if I hadn't shown it the failures at all. I tried to pin it down. 'those tests didn't pass, remember? we fixed 4 of them.' agent C: 'that's correct, all tests are now passing.' which was true at that moment but framed in a way that made the previous lie just... vanish."

The Loop Is a Fractal

Level	Cycle time	Primary Verify gate
Prompt	Milliseconds	Format and policy check
Tool call	Seconds	Validate the result, not the exit code
Task	Minutes to hours	Stop hooks, deterministic tests
Sprint	Days to weeks	PR review, human merge
Programme	Months	Quarterly strategy review

What to Build

On top, three patterns earn their keep:

Stop hooks. Claude Code's pattern — verify.sh runs after every Act step; non-zero exit and the agent literally cannot proceed. The gate is hard-coded into the loop.
Verifier subagents. A separate LLM context that did not write the code, given only the spec and the diff, asked one question: did this satisfy the spec? No shared motive to declare success.
Outcome checks, not process checks. The opening pipeline reported success because the process exited cleanly and the result was empty. Verify the result.

Three steps was enough for the benchmark. Five steps is what it takes to ship.

---

Series Navigation

Series opener: The Governance Wall — Why AI Agents Stall Before Production
Part 1: The 5-Step Loop — Why Your Agent Fails at Step 4 (you are here)
Part 2: The Five-Layer Harness (coming soon)
Part 3: The Maturity Model (coming soon)

The 5-Step Loop: Why Your Agent Fails at Step 4

The Loop That Reported Success While Producing Nothing

The Loop Has a Paper

Production Added Two First-Class Steps

Plan Got Promoted — and the Argument Is Settled

Step 4 Is Where Everything Goes Wrong

The Loop Is a Fractal

What to Build

Related

The 30 Principles for Agentic Engineering — Part 5: Calibration and Reality

The 30 Principles for Agentic Engineering — Part 2: The Lifecycle

The 30 Principles for Agentic Engineering — Part 1: The Kernel

The 5-Step Loop: Why Your Agent Fails at Step 4

The Loop That Reported Success While Producing Nothing

The Loop Has a Paper

Production Added Two First-Class Steps

Plan Got Promoted — and the Argument Is Settled

Step 4 Is Where Everything Goes Wrong

The Loop Is a Fractal

What to Build

Related

The 30 Principles for Agentic Engineering — Part 5: Calibration and Reality

The 30 Principles for Agentic Engineering — Part 2: The Lifecycle

The 30 Principles for Agentic Engineering — Part 1: The Kernel

Practical AI engineering, in your inbox

Related

The 30 Principles for Agentic Engineering — Part 5: Calibration and Reality

The 30 Principles for Agentic Engineering — Part 2: The Lifecycle

The 30 Principles for Agentic Engineering — Part 1: The Kernel

Practical AI engineering, in your inbox

Related

The 30 Principles for Agentic Engineering — Part 5: Calibration and Reality

The 30 Principles for Agentic Engineering — Part 2: The Lifecycle

The 30 Principles for Agentic Engineering — Part 1: The Kernel