Loop Engineering: The Loop Was Never the Hard Part

I didn't write the first draft of this post. A loop did.

I typed one command, handed it a folder of research, and walked away. While I made dinner, a fleet of agents argued about it — researchers pulled sources, a strategist picked the angle, a writer drafted, a fact-checker tore through the numbers, an editor scored the prose against a rubric. State got written to disk between every stage so the next agent could pick up where the last one died. I came back to a draft, not a blank page.

So when half my feed started telling me this month to "stop prompting and start looping," my reaction was: yes — and you're selling the wrong half. I've been shipping loops for a year, and the lesson I trust didn't come from the clean wins. It came from a receipt. Early versions of Gluon, my own agent orchestrator, ran without a circuit breaker — and a single run once blew through $500 in tokens before I noticed. Another lost two hours in retry hell, Claude convinced the error was its fault when the system just needed a restart. That's the part nobody's selling. That's the part that's actually engineering.

What loop engineering actually is — and who said what

The term has a clean provenance, and most of the videos get it wrong. Boris Cherny, who heads Claude Code at Anthropic, is the catalyst. On stage at the WorkOS event on 2 June 2026 he said: "I don't prompt Claude anymore. I have loops that are running. They're the ones that are prompting Claude and figuring out what to do. My job is to write loops." Five days later Peter Steinberger — the creator of OpenClaw, now at OpenAI — posted the line that actually went viral, to 8.4 million views: "You shouldn't be prompting coding agents anymore. You should be designing loops that prompt your agents." And the same day, Addy Osmani gave the idea its name: "Loop engineering is replacing yourself as the person who prompts the agent. You design the system that does it instead." His sharpest line is worth tattooing on the wall: the agent forgets, the repo doesn't.

It's a real progression, not an obituary for what came before:

Prompt engineering  ->  Context engineering  ->  Harness engineering  ->  Loop engineering
(steer one output)      (feed the right info)    (stable exec env)        (the system prompts
                                                                           the agent for you)

Prompt engineering didn't die; it moved up a level — to the seed, the spec, the test cases, the definition of done. A precise instruction matters more in a loop, because it now compounds across hundreds of unwatched steps instead of one.

Strip away the marketing and a loop is a plain object: a cron job with a decision-maker in the body. Something wakes the agent, it reads state, acts, observes, checks the result against an objective "done," and either stops or goes around again — persisting what it learned to disk so the next run is smarter. The lineage runs straight back to the ReAct paper (Yao et al., 2022): reason, act, observe, repeat. AutoGPT tried it in 2023, ran in circles, and burned tokens. The 2026 version is the same loop with three things AutoGPT lacked: context resets, a separate verifier, and a hard stop.

The problem starts the moment people confuse writing the loop with doing the work the loop is supposed to do.

The loop is the easy 20%

Here's the whole canonical loop, the one that started this entire discipline — Geoffrey Huntley's Ralph loop:

bash

while :; do cat PROMPT.md | claude-code ; done

That's it. An infinite loop piping a prompt file into a coding agent, fresh context every iteration. Thirty seconds to write, and genuinely powerful — Huntley built an entire programming language out of variations on this idea.

But a while loop with no exit condition isn't autonomy. If you didn't build the part that decides when to stop, you built a very confident token furnace — and almost every video skips straight past that to the dashboard with the big numbers. The hard 80% is the three things the one-liner doesn't have:

A done-check the agent can't fake. Not "the model said DONE." Something external and objective.
A verifier that never grades its own homework. The thing that produced the work cannot be the thing that signs off on it.
A hard ceiling so a runaway can't eat your weekend. My $500 receipt is what the absence of this costs.

And here's the part the hype has exactly backwards. A loop makes the agent produce faster; it does nothing to make the output correct. The binding constraint doesn't relax — it moves, from "how fast can I write the prompt?" to "how fast can I verify the output before it ships?" The agents generate faster than any human can read. The loop didn't remove that bottleneck. It pointed a firehose at it.

The primitives that exist — and the ones the videos made up

It's worth knowing which parts of that machine actually exist in Claude Code today, because the videos are confidently wrong and being wrong here costs you real time. What's real and buildable, all documented at code.claude.com:

The headless CLI is the backbone for any bash-orchestrated loop: claude -p "<prompt>" --output-format stream-json --max-turns N --max-budget-usd 5.00 --allowedTools "Bash" "Edit", with --dangerously-skip-permissions when you mean it. Hooks are native lifecycle events — PreToolUse, PostToolUse, SubagentStop, and the one that matters most for loops, Stop, which fires when Claude tries to finish and can refuse to let it (a hook exiting with code 2 blocks the action). Subagents live in .claude/agents/*.md — markdown files with frontmatter that let you build the Maker/Checker separation as actual, separate agents with their own tools and models. Skills live in .claude/skills/. MCP connectors reach Slack, Linear, Sentry, GitHub. Git worktrees isolate parallel agents. And the Claude Agent SDK (@anthropic-ai/claude-agent-sdk) hands you the entire agent loop as a library call: query() runs it for you — gather context, take action, verify work, repeat.

/loop is real: a bundled Claude Code skill, prompt-based rather than a magic API primitive, overridable with your own .claude/skills/loop/SKILL.md. And here's the part worth correcting — because the tutorials get it backwards, and so did an earlier draft of this post. /goal is real too, and it's the most interesting primitive here: a first-party command (Claude Code v2.1.139+) that keeps Claude working toward a condition you state, and after every turn a separate, smaller model checks whether the condition holds before Claude is allowed to continue. The built-in command for autonomous loops ships with an independent done-check baked in. Cloud routines and scheduled tasks run work independent of any open session — but you can still wire a schedule the boring way: external cron plus claude -p, or background agents via --bg. (I got this wrong the first time because I trusted a pile of explainer videos over the docs. The irony, given everything below about verifying against ground truth instead of the agent's own say-so, is not lost on me.)

Notice what these primitives are for. --max-turns and --max-budget-usd are the hard stop. The Stop hook is where "don't declare done until the checks pass" lives. Subagents are how you get a verifier with no loyalty to the maker. And /goal proves the pattern at the product level: the command built for unattended loops spends its cleverness on a fresh model that decides when to stop, not on making Claude go. The interesting primitives in Claude Code aren't the ones that make the loop go. They're the ones that make it stop for the right reasons.

Here's the smallest loop that actually works.

Worked example 1: agent-until-green, where the shell is the judge

The cleanest loop is the one where "done" is a green checkmark you didn't have to interpret. Tests pass or they don't, so the trick is to let the shell — not the agent — be the one that checks. The spine is three files on disk and a bash loop, because the agent forgets between iterations and the repo doesn't:

bash

#!/usr/bin/env bash
# loop.sh — run an agent against a spec until the tests pass
set -euo pipefail

MAX="${1:-50}"
ITERATION=0

while [[ $ITERATION -lt $MAX ]]; do
  ITERATION=$((ITERATION + 1))
  echo "=== ITERATION $ITERATION ==="

  # Fresh context window every iteration (Ralph pattern).
  # PROMPT.md tells the agent: read IMPLEMENTATION_PLAN.md, do the next
  # task, run the tests, log to PROGRESS.md, commit, exit.
  cat PROMPT.md | claude-code

  # The verification gate. This runs whether or not the agent claims DONE.
  if bun test > /tmp/verify.txt 2>&1; then
    echo "Tests pass. Verified after $ITERATION iteration(s)."
    git add -A && git commit -m "feat: loop complete after $ITERATION iterations"
    exit 0
  fi

  echo "Still red. Continuing."
  tail -20 /tmp/verify.txt
done

echo "Hit the iteration ceiling without going green." >&2
exit 1

Two design decisions carry this example, and both are invisible if you skim it.

First, fresh context every iteration. The agent re-reads IMPLEMENTATION_PLAN.md and PROGRESS.md from disk and starts clean. Huntley's framing: context windows are arrays you're mallocing your prompt into; let them grow across forty iterations and the model rots, repeating dead ends and forgetting the plan. Reset every loop and each pass works against durable files.

Second, the agent's claim of "DONE" is ignored. The loop exits on bun test, not the model's say-so. That single if is why this loop is trustworthy and the bare one-liner is not.

Tests are a luxury — a binary done-check for free. The harder case is when "done" isn't a green checkmark but "is this code actually good?" That's where you need a second agent with no loyalty to the first.

Worked example 2: Maker/Checker — and why it can't be the same agent

When you can't reduce "done" to a test, you reach for the pattern Anthropic documents in Building Effective Agents: the evaluator-optimizer. One call generates the work. A separate call, with its own independent context, evaluates it and feeds back. Loop until it passes or you hit the cap.

The non-negotiable detail — Anthropic states it explicitly — is that the two calls do not share a system prompt. The evaluator never sees the generator's reasoning, so it can't grade on a curve. Collapse them into one call ("now critique your own work") and you've rebuilt the problem you were trying to solve.

python

# Two calls. No shared context. The evaluator has no loyalty to the maker.
gen = client.messages.create(
    model="claude-opus-4-5",
    messages=[{"role": "user", "content": task}],
)
solution = gen.content[0].text

evaluation = client.messages.create(
    model="claude-opus-4-5",
    system="You are a strict evaluator with no loyalty to the proposed solution. "
           "Reply PASS if it fully satisfies the task, or FAIL with specific gaps.",
    messages=[{"role": "user",
               "content": f"Task: {task}\n\nSolution:\n{solution}\n\nEvaluate."}],
)

Two hardening twists worth stealing. Run the checker on a time delay — a couple of hours after the maker — and you catch the errors that only surface once you're no longer primed to wave them through. And the checker can run on a cheaper model: Opus makes, Haiku checks. Better still, verify across model families — Claude builds, GPT verifies — which reduces the in-group bias of two agents trained on the same distribution.

But I run this myself, so let me be honest about it. I run four or five Claude Code agents across my projects, and one reviews the others' diffs. It catches real things; I'd be slower without it. But I've never been tempted to call it a control. A loop that grades its own output is just AI reviewing AI wearing a bash script — and the day an auditor opens a sensitive PR and asks "who reviewed this?", "an agent" is not an answer they accept. Cross-model verification is excellent screening. It is not independence.

I learned the done-check the expensive way, on Gluon. My first instinct was to ask Claude to output "DONE" when finished. It doesn't work — Claude gets creative, "done" because one section completed, or "I've finished the implementation, no tests yet" when the requirement was implementation plus tests. So I built a multi-signal completion detector: four independent signals, a 60% confidence threshold, exiting only when enough align. The whole architecture exists because the agent will happily tell you it's done in the middle of not being done.

Both examples assume you should be running this at all. The bills and the failure reports say — only sometimes.

The honest cost, and the honest failure rate

First, correct the number you've seen, because it anchors every breathless thread and it's misattributed. The viral $1.3 million-per-month token bill — precisely $1,305,088.81 over 30 days, about 603 billion tokens — belongs to Peter Steinberger's OpenClaw project, which OpenAI funds. Not Pieter Levels, as the threads keep claiming; Steinberger has said as much directly. The companion "92,000 PRs" is unverified; his own number is closer to 30,000. It's not the aspiration the videos imply — it's the cautionary exhibit, what a token furnace looks like when someone else is footing the bill.

For the rest of us, the verified economics are mundane and plannable: about $13 per developer per active day, 90th percentile under $30, $150 to $250 per developer per month for teams running agents seriously, $600 to $1,500 for power users on the raw API. Claude Code burns roughly 3 to 4 times more tokens per task than Codex, because it reads more files and plans more before it edits. And Anthropic added rate limits in early 2026 specifically to curb people running Claude Code in the background all day — when the vendor throttles your loops, the loops were costing the vendor money.

The failure side is better documented than the success side, which should tell you something. Microsoft's AI Red Team published a taxonomy of agentic failure modes on 4 June 2026: silent failures, goal hijacking, memory poisoning, human-in-the-loop bypass. The single image that captures the risk is from a documented incident: an agent called a broken tool 400 times in five minutes. Loops don't self-correct toward truth. Left unguarded, they reinforce whatever they're doing — including being wrong, faster.

The adoption numbers are the most honest part of the picture. Anthropic's own 2026 report finds that even in AI-heavy orgs, developers use AI in about 60% of their work but fully delegate only 0 to 20% of tasks to autonomous agents. The famous "90 to 95% of code is AI-generated" stat is mangled in transit: Garry Tan's actual claim was that 25% of one YC cohort, Winter 2025, had 95% of their lines LLM-generated. Twenty-five percent of one batch is not "all top startups." Keep the qualifier.

And the skeptics deserve their seat. As of mid-2026, loops shine on binary, objective tasks and remain slop machines for nuanced, creative work — the consensus of Greg Isenberg, Ras Mic, and others who actually ship. My own honest productivity number sits at something close to 1.3×, not 10×, and it took a few sprints of watching the bug queue catch up with me before I trusted even that. Loops didn't change the multiple. They changed where the time goes — out of typing prompts, and into reading output.

One law, not five

The discourse will hand you five design laws: write an objective done-check, never self-grade, keep state on disk, set hard stops, engineer the seed. They're all correct. But four are downstream of one:

A loop is only as good as its done-check, and the agent that wrote the work can't be the one to sign off on it.

Get that right and the rest follow — you keep state on disk because the verifier needs durable ground truth, you set hard stops because you've accepted the loop can be confidently wrong, you engineer the seed because a vague spec poisons an objective check.

mermaid


Rendering diagram...

The easy 20% is the blue boxes. The hard 80% is the two yellow ones.

So before you reach for a loop at all, run four conditions past the candidate; it only earns its place if all four hold. Is the task recurring? Can "done" be objectively verified? Can you afford the wasted runs? Does the agent have the tools to do and verify the work? Any "no" and you don't have a loop candidate. You have a token furnace waiting for a match.

Which brings me back to the loop that wrote this post — and the one thing it still can't do.

What my own pipeline still can't do

The pipeline that drafted this article gets the hard parts right on purpose. The writer agent never grades itself — a fact-checker and an editor do, against a rubric, with their own independent context. State lives in a body-of-work.md on disk, so the next post starts smarter than this one. There are hard stops. That separation is why it produces something worth editing instead of confident garbage, and why I trust the loop enough to walk away from it.

But there's a loop it can't close. The pipeline can verify this post is internally sound — sources cited, claims supported, prose not slop. It cannot tell whether the post actually landed: whether the argument was right, whether it changed one CTO's mind. I could build a delayed scraper that feeds published engagement back into body-of-work.md, the same trick that closes the fuzzy LinkedIn loops. But even wired up, "was the argument correct?" remains a judgment call. There is no objective done-check for true.

That's the same thing the auditors and the regulators keep asking for: a named human who can say "yes, this is right, ship it." The loop can't be that person. It was never going to be.

So the loop didn't remove me from the work. It moved me to the only seat that was ever load-bearing — the one where someone decides the output is true before it goes out. Stop prompting and start looping, by all means. Just don't mistake the loop for the job. I came back from dinner to a draft, and then I did the only part that was ever hard.

I didn't write the first draft of this post. A loop did.

What loop engineering actually is — and who said what

It's a real progression, not an obituary for what came before:

Prompt engineering  ->  Context engineering  ->  Harness engineering  ->  Loop engineering
(steer one output)      (feed the right info)    (stable exec env)        (the system prompts
                                                                           the agent for you)

The problem starts the moment people confuse writing the loop with doing the work the loop is supposed to do.

The loop is the easy 20%

Here's the whole canonical loop, the one that started this entire discipline — Geoffrey Huntley's Ralph loop:

bash

while :; do cat PROMPT.md | claude-code ; done

A done-check the agent can't fake. Not "the model said DONE." Something external and objective.
A verifier that never grades its own homework. The thing that produced the work cannot be the thing that signs off on it.
A hard ceiling so a runaway can't eat your weekend. My $500 receipt is what the absence of this costs.

The primitives that exist — and the ones the videos made up

Here's the smallest loop that actually works.

Worked example 1: agent-until-green, where the shell is the judge

bash

#!/usr/bin/env bash
# loop.sh — run an agent against a spec until the tests pass
set -euo pipefail

MAX="${1:-50}"
ITERATION=0

while [[ $ITERATION -lt $MAX ]]; do
  ITERATION=$((ITERATION + 1))
  echo "=== ITERATION $ITERATION ==="

  # Fresh context window every iteration (Ralph pattern).
  # PROMPT.md tells the agent: read IMPLEMENTATION_PLAN.md, do the next
  # task, run the tests, log to PROGRESS.md, commit, exit.
  cat PROMPT.md | claude-code

  # The verification gate. This runs whether or not the agent claims DONE.
  if bun test > /tmp/verify.txt 2>&1; then
    echo "Tests pass. Verified after $ITERATION iteration(s)."
    git add -A && git commit -m "feat: loop complete after $ITERATION iterations"
    exit 0
  fi

  echo "Still red. Continuing."
  tail -20 /tmp/verify.txt
done

echo "Hit the iteration ceiling without going green." >&2
exit 1

Two design decisions carry this example, and both are invisible if you skim it.

Second, the agent's claim of "DONE" is ignored. The loop exits on bun test, not the model's say-so. That single if is why this loop is trustworthy and the bare one-liner is not.

Worked example 2: Maker/Checker — and why it can't be the same agent

python

# Two calls. No shared context. The evaluator has no loyalty to the maker.
gen = client.messages.create(
    model="claude-opus-4-5",
    messages=[{"role": "user", "content": task}],
)
solution = gen.content[0].text

evaluation = client.messages.create(
    model="claude-opus-4-5",
    system="You are a strict evaluator with no loyalty to the proposed solution. "
           "Reply PASS if it fully satisfies the task, or FAIL with specific gaps.",
    messages=[{"role": "user",
               "content": f"Task: {task}\n\nSolution:\n{solution}\n\nEvaluate."}],
)

Both examples assume you should be running this at all. The bills and the failure reports say — only sometimes.

The honest cost, and the honest failure rate

One law, not five

A loop is only as good as its done-check, and the agent that wrote the work can't be the one to sign off on it.

mermaid


Rendering diagram...

The easy 20% is the blue boxes. The hard 80% is the two yellow ones.

Which brings me back to the loop that wrote this post — and the one thing it still can't do.

What my own pipeline still can't do

That's the same thing the auditors and the regulators keep asking for: a named human who can say "yes, this is right, ship it." The loop can't be that person. It was never going to be.

Loop Engineering: The Loop Was Never the Hard Part

What loop engineering actually is — and who said what

The loop is the easy 20%

The primitives that exist — and the ones the videos made up

Worked example 1: agent-until-green, where the shell is the judge

Worked example 2: Maker/Checker — and why it can't be the same agent

The honest cost, and the honest failure rate

One law, not five

What my own pipeline still can't do

The Cutler.sg Newsletter

Ralph Loop: Teaching AI Agents to Work Autonomously (Without Burning Your Budget)

When a Bank Open-Sources Its AI, It Ships the Control Layer — Not the Model

The 30 Principles for Agentic Engineering — Part 5: Calibration and Reality

Loop Engineering: The Loop Was Never the Hard Part

What loop engineering actually is — and who said what

The loop is the easy 20%

The primitives that exist — and the ones the videos made up

Worked example 1: agent-until-green, where the shell is the judge

Worked example 2: Maker/Checker — and why it can't be the same agent

The honest cost, and the honest failure rate

One law, not five

What my own pipeline still can't do

The Cutler.sg Newsletter

Ralph Loop: Teaching AI Agents to Work Autonomously (Without Burning Your Budget)

When a Bank Open-Sources Its AI, It Ships the Control Layer — Not the Model

The 30 Principles for Agentic Engineering — Part 5: Calibration and Reality