Loop Engineering: The Loop Was Never the Hard Part
$ grep -n "^##" 2026-06-loop-engineering-claude-code.md>
- 8:What loop engineering actually is — and who said what
- 28:The loop is the easy 20%
- 50:The primitives that exist — and the ones the videos made up
- 66:Worked example 1: agent-until-green, where the shell is the judge
- 112:Worked example 2: Maker/Checker — and why it can't be the same agent
- 143:The honest cost, and the honest failure rate
- 159:One law, not five
- 189:What my own pipeline still can't do
I didn't write the first draft of this post. A loop did.
I typed one command, handed it a folder of research, and walked away. While I made dinner, a fleet of agents argued about it — researchers pulled sources, a strategist picked the angle, a writer drafted, a fact-checker tore through the numbers, an editor scored the prose against a rubric. State got written to disk between every stage so the next agent could pick up where the last one died. I came back to a draft, not a blank page. The thing you're reading is the output of an agentic loop, polished by the human it was built to free up.
So when half my feed started telling me this month to "stop prompting and start looping," I had a specific reaction: yes — and you're selling the wrong half. The loop is real. I've been shipping loops for a year. But the lesson I actually trust didn't come from the clean wins like the one above. It came from a receipt. Early versions of Gluon, my own agent orchestrator, ran without a circuit breaker — and a single run once blew through $500 in tokens before I noticed. Another lost two hours stuck in retry hell — Claude convinced the error was its fault when the system just needed a restart. Picture opening your cost dashboard on a Monday to find an agent spent the weekend arguing with itself. That's the part nobody's selling. That's the part that's actually engineering.
What loop engineering actually is — and who said what
The term has a clean provenance, and most of the videos get it wrong, so let me set it straight.
Boris Cherny, who heads Claude Code at Anthropic, is the catalyst. On stage at the WorkOS event on 2 June 2026 he said: "I don't prompt Claude anymore. I have loops that are running. They're the ones that are prompting Claude and figuring out what to do. My job is to write loops." Five days later Peter Steinberger — the creator of OpenClaw, now at OpenAI — posted the line that actually went viral, to 8.4 million views: "You shouldn't be prompting coding agents anymore. You should be designing loops that prompt your agents." And the same day, Addy Osmani gave the idea its name and its definition: "Loop engineering is replacing yourself as the person who prompts the agent. You design the system that does it instead." His sharpest line is the one worth tattooing on the wall: the agent forgets, the repo doesn't.
That's the honest version of the term. It's a real progression, not an obituary for what came before:
Prompt engineering -> Context engineering -> Harness engineering -> Loop engineering
(steer one output) (feed the right info) (stable exec env) (the system prompts
the agent for you)
Prompt engineering didn't die. It moved up a level — to the seed, the spec, the test cases, the definition of done. Discovering the wrench doesn't mean you throw out the screwdriver. The skill of writing a precise instruction matters more in a loop, not less, because that instruction now compounds across hundreds of unwatched steps instead of one.
Strip away the marketing and a loop is a plain object: a cron job with a decision-maker in the body. Something wakes the agent, it reads state, acts, observes the result, checks the result against an objective "done," and either stops or goes around again — persisting what it learned to disk so the next run is smarter. None of it is exotic. The lineage runs straight back to the ReAct paper (Yao et al., 2022): reason, act, observe, repeat. AutoGPT tried it in 2023, ran in circles, and burned tokens. The 2026 version is the same loop with three things bolted on that AutoGPT lacked: context resets, a separate verifier, and a hard stop.
All of which is real. The problem starts the moment people confuse writing the loop with doing the work the loop is supposed to do.
The loop is the easy 20%
Here's the whole canonical loop. The one that started this entire discipline, Geoffrey Huntley's Ralph loop:
while :; do cat PROMPT.md | claude-code ; done
That's it. An infinite loop piping a prompt file into a coding agent, fresh context every iteration. Thirty seconds to write. You could ship it before your coffee cools. And it is genuinely powerful — Huntley built an entire programming language out of variations on this idea.
But thirty seconds of typing is not the skill. A while loop with no exit condition isn't autonomy. If you didn't build the part that decides when to stop, you didn't build a loop — you built a very confident token furnace, and almost every video skips straight past that to show you the dashboard with the big numbers.
A loop like that is a way to convert your API balance into heat. The hard 80% — the part that's actually engineering — is the three things the one-liner doesn't have:
- A done-check the agent can't fake. Not "the model said DONE." Something external and objective that says done.
- A verifier that never grades its own homework. The thing that produced the work cannot be the thing that signs off on it.
- A hard ceiling so a runaway can't eat your weekend. My $500 receipt is what the absence of this costs.
Loop engineering is real, but the loop itself is the easy part: anyone can write that one-liner in thirty seconds — the actual engineering is the objective done-check and the independent verifier the agent can't fake.
And here's the part the hype has exactly backwards. A loop makes the agent produce faster. It does nothing to make the output more correct. So the binding constraint doesn't relax — it moves and it tightens. It stops being "how fast can I write the prompt?" and becomes "how fast can I verify the output before it ships?" This is the same bottleneck I keep running into in every corner of this work: the agents generate faster than any human can read. The loop didn't remove the bottleneck. It pointed a firehose at it.
The primitives that exist — and the ones the videos made up
If the hard part is the machine around the loop, it's worth knowing which parts of that machine actually exist in Claude Code today. The videos are confidently wrong about this, and being wrong here costs you real time.
What's real and buildable today, all documented at code.claude.com:
The headless CLI is the backbone for any bash-orchestrated loop: claude -p "<prompt>" --output-format stream-json --max-turns N --max-budget-usd 5.00 --allowedTools "Bash" "Edit", with --dangerously-skip-permissions when you mean it. Hooks are native lifecycle events — PreToolUse, PostToolUse, SubagentStop, and the one that matters most for loops, Stop, which fires when Claude tries to finish and can refuse to let it (a hook exiting with code 2 blocks the action). Subagents live in .claude/agents/*.md — markdown files with frontmatter that let you build the Maker/Checker separation as actual, separate agents with their own tools and models. Skills live in .claude/skills/. MCP connectors reach Slack, Linear, Sentry, GitHub. Git worktrees isolate parallel agents. And the Claude Agent SDK (@anthropic-ai/claude-agent-sdk) hands you the entire agent loop as a library call: query() runs it for you — gather context, take action, verify work, repeat — so you don't reimplement the tool-use loop by hand.
/loop is real. It's a bundled Claude Code skill, prompt-based rather than a magic API primitive, and you can override it with your own .claude/skills/loop/SKILL.md.
Now the part worth saying plainly, because it's a service to anyone about to build on a tutorial: /goal, /schedule, and "routines" are not official Anthropic features. The videos conflate them from Codex and third-party tools. There is no documented /goal slash command and no first-party "routines" API in Claude Code. If you want a loop to run on a schedule, you do it the boring, real way: external cron plus claude -p, or background agents via the --bg flag.
Notice what every one of those primitives is for. --max-turns and --max-budget-usd are the hard stop. The Stop hook is where "don't declare done until the checks actually pass" lives. Subagents are how you get a verifier with no loyalty to the maker. The interesting primitives in Claude Code aren't the ones that make the loop go. They're the ones that make it stop for the right reasons. The tooling agrees with the thesis: the engineering is in the guardrails.
Primitives are inert until you arrange them into a loop that knows when it's done. Here's the smallest one that actually works.
Worked example 1: agent-until-green, where the shell is the judge
The cleanest loop is the one where "done" is a green checkmark you didn't have to interpret. Tests pass or they don't. So the entire trick is to let the shell — not the agent — be the one that checks.
The spine is three files on disk and a bash loop. State lives in the files, because the agent forgets between iterations and the repo doesn't:
#!/usr/bin/env bash
# loop.sh — run an agent against a spec until the tests pass
set -euo pipefail
MAX="${1:-50}"
ITERATION=0
while [[ $ITERATION -lt $MAX ]]; do
ITERATION=$((ITERATION + 1))
echo "=== ITERATION $ITERATION ==="
# Fresh context window every iteration (Ralph pattern).
# PROMPT.md tells the agent: read IMPLEMENTATION_PLAN.md, do the next
# task, run the tests, log to PROGRESS.md, commit, exit.
cat PROMPT.md | claude-code
# The verification gate. This runs whether or not the agent claims DONE.
if bun test > /tmp/verify.txt 2>&1; then
echo "Tests pass. Verified after $ITERATION iteration(s)."
git add -A && git commit -m "feat: loop complete after $ITERATION iterations"
exit 0
fi
echo "Still red. Continuing."
tail -20 /tmp/verify.txt
done
echo "Hit the iteration ceiling without going green." >&2
exit 1
Two design decisions are carrying this whole example, and both are invisible if you skim it.
The first is the fresh context every iteration. The agent doesn't accumulate the conversation — it re-reads IMPLEMENTATION_PLAN.md and PROGRESS.md from disk and starts clean. Huntley's framing is that context windows are arrays, and you're mallocing your prompt into them; let them grow across forty iterations and the model rots, repeating dead ends and forgetting the plan. Reset every loop and each pass is a clean slate working against durable files.
The second is the line that matters more than all the rest combined: the agent's claim of "DONE" is ignored. The loop exits on bun test, not on the model's say-so. That single if is two of the design laws expressed in shell — objective done-check, and never let the maker grade itself — and it's why this loop is trustworthy and the bare while :; do ... one-liner is not.
Tests are a luxury. They hand you a binary done-check for free. The harder case is when "done" isn't a green checkmark — when it's "is this code actually good?" or "is this argument actually right?" That's where you need a second agent that has no loyalty to the first.
Worked example 2: Maker/Checker — and why it can't be the same agent
When you can't reduce "done" to a test, you reach for the pattern Anthropic documents in Building Effective Agents: the evaluator-optimizer. One call generates the work. A separate call, with its own independent context, evaluates it and feeds back. Loop until it passes or you hit the iteration cap.
The non-negotiable detail — Anthropic states it explicitly in the cookbook — is that the generator and the evaluator are two separate calls that do not share a system prompt. The evaluator never sees the generator's reasoning. It can't grade on a curve because it has no curve to grade on. The moment you collapse them into one call ("now critique your own work"), you've rebuilt the problem you were trying to solve.
# Two calls. No shared context. The evaluator has no loyalty to the maker.
gen = client.messages.create(
model="claude-opus-4-5",
messages=[{"role": "user", "content": task}],
)
solution = gen.content[0].text
evaluation = client.messages.create(
model="claude-opus-4-5",
system="You are a strict evaluator with no loyalty to the proposed solution. "
"Reply PASS if it fully satisfies the task, or FAIL with specific gaps.",
messages=[{"role": "user",
"content": f"Task: {task}\n\nSolution:\n{solution}\n\nEvaluate."}],
)
There are two hardening twists worth stealing. Run the checker on a time delay — a couple of hours after the maker, not back-to-back — and you catch the class of errors that only surface once the work has settled and you're no longer primed to wave them through. And the checker can run on a cheaper model: Opus makes, Haiku checks. You can also change the prior entirely and verify across model families — Claude builds, GPT verifies — which reduces the in-group bias of two agents trained on the same distribution.
But I have to be honest about what that arrangement is, because I run it myself, and the temptation to oversell it is exactly the trap this whole post is about. I run four or five Claude Code agents across my projects, and one of them reviews the others' diffs. It catches real things. I'd be slower without it. But I've never once been tempted to call that arrangement a control. A loop that grades its own output is just AI reviewing AI wearing a bash script — and the day an auditor opens a sensitive PR and asks "who reviewed this?", "an agent" is not an answer they accept. Cross-model verification is excellent screening. It is not independence. Keep the distinction or you'll fool yourself faster than the loop fools you.
I learned the done-check the expensive way, on Gluon. My first instinct was the obvious one: ask Claude to output "DONE" when finished. It doesn't work. Claude gets creative — outputs "done" because one section completed, or "I've finished the implementation, no tests yet" when the requirement was implementation plus tests. Single signals fail. Every time. So I built a multi-signal completion detector instead: four independent signals, a confidence threshold of 60%, exiting only when enough of them align. The whole architecture exists because the agent will happily tell you it's done in the middle of not being done.
Both of these examples assume something I haven't defended yet: that you should be running this at all. The bills and the failure reports say — only sometimes.
The honest cost, and the honest failure rate
Let me correct the number you've seen, because it's the centrepiece of every breathless thread and it's misattributed.
The viral $1.3 million-per-month AI token bill — precisely $1,305,088.81 over 30 days, about 603 billion tokens — belongs to Peter Steinberger's OpenClaw project, which OpenAI funds as a perk of supporting it. Not Pieter Levels, as the threads keep claiming. Steinberger has said as much directly. The companion figure, "92,000 PRs," is unverified; his own number is closer to 30,000. None of this is the aspiration the videos imply. It is the cautionary exhibit — what a token furnace looks like when someone else is footing the bill.
For the rest of us, the verified economics are mundane and plannable: about $13 per developer per active day on average, 90th percentile under $30 a day, $150 to $250 per developer per month for teams running agents seriously, and $600 to $1,500 a month for power users on the raw API. Worth knowing too: Claude Code burns roughly 3 to 4 times more tokens per task than Codex on similar work, because it reads more files and plans more before it edits. And Anthropic added rate limits in early 2026 specifically to curb people running Claude Code in the background all day. When the vendor throttles your loops, the loops were costing the vendor money.
The failure side is better documented than the success side, and that should tell you something. Microsoft's AI Red Team published a taxonomy of agentic failure modes on 4 June 2026: silent failures where the agent looks like it's working but makes no real progress, goal hijacking, memory poisoning where earlier context contaminates later decisions, and human-in-the-loop bypass. The single image that captures the whole risk is from a documented incident: an agent called a broken tool 400 times in five minutes. That's the thing to internalise about loops. They don't self-correct toward truth. Left unguarded, they reinforce whatever they're doing — including being wrong, faster.
Which is why the adoption numbers are the most honest part of the picture. Anthropic's own 2026 report finds that even in AI-heavy orgs, developers use AI in about 60% of their work but fully delegate only 0 to 20% of tasks to autonomous agents — because a loop without an independent verifier just produces confident output faster, not correct output. The famous "90 to 95% of code is AI-generated" stat is real but mangled in transit: Garry Tan's actual claim was that 25% of one YC cohort, Winter 2025, had 95% of their lines LLM-generated. Twenty-five percent of one batch is not "all top startups." Keep the qualifier.
And the skeptics deserve their seat. As of mid-2026, loops genuinely shine on binary, objective tasks and remain slop machines for nuanced, creative work — the consensus of Greg Isenberg, Ras Mic, and others who actually ship. My own honest productivity number sits at something close to 1.3×, not 10×, and it took a few sprints of watching the bug queue catch up with me before I trusted even that. Loops didn't change the multiple. They changed where the time goes — out of typing prompts, and into reading output.
So if loops are worth it only where "done" is objective and a separate thing verifies it — that's not five laws to memorise. It's one law with consequences.
One law, not five
The discourse will hand you five design laws: write an objective done-check, never self-grade, keep state on disk, set hard stop conditions, engineer the seed. They're all correct. But four of them are downstream of one.
A loop is only as good as its done-check, and the agent that wrote the work can't be the one to sign off on it.
Get those two right and the rest follow — you'll keep state on disk because the verifier needs durable ground truth to check against, you'll set hard stops because you've accepted the loop can be confidently wrong, you'll engineer the seed because a vague spec poisons an objective check. Get those two wrong and no amount of automation, scheduling, or clever orchestration saves you. You've just built a faster way to be confidently wrong.
Rendering diagram...
Everything in the easy 20% is the blue boxes. The entire hard 80% is the two yellow ones — the gate the agent can't fake, and the ceiling that catches it when the gate isn't enough.
The practical filter, then, isn't a checklist to action — it's a question to ask before you reach for a loop at all. I run four conditions past every candidate, and the loop only earns its place if all four are true. Is the task recurring? Can "done" be objectively verified? Can you afford the wasted runs? Does the agent have the tools to do and verify the work? Any "no" and you don't have a loop candidate. You have a token furnace waiting for a match.
Which brings me back to the loop that wrote this post — and the one thing it still can't do.
What my own pipeline still can't do
The pipeline that drafted this article is a loop-engineering artifact, and it gets the hard parts right on purpose. The writer agent never grades itself — a fact-checker and an editor do, against a rubric, with their own independent context. State lives in a body-of-work.md on disk, so the next post I write starts smarter than this one instead of from zero. There are hard stops. That separation is exactly why it produces something worth editing instead of confident garbage, and it's why I trust the loop enough to walk away from it.
But there's a loop it can't close, and the gap is instructive. The pipeline can verify that this post is internally sound — sources cited, claims supported, prose not slop. It cannot tell whether the post actually landed. Whether the argument was right, whether it changed one CTO's mind, whether anyone past the headline agreed. I could build a delayed scraper that reads the published analytics and feeds engagement back into body-of-work.md — the same delayed-metrics trick that closes the fuzzy LinkedIn loops. But even with that wired up, "was the argument correct?" remains a judgment call. There is no objective done-check for true.
That's the same thing the auditors keep asking for, and the regulators, and anyone who has to stand behind the output: a named human who can say "yes, this is right, ship it." The loop can't be that person. It was never going to be.
So the loop didn't remove me from the work. It moved me to the only seat that was ever load-bearing — the one where someone decides the output is true before it goes out. Stop prompting and start looping, by all means. Just don't mistake the loop for the job. I came back from dinner to a draft, and then I did the only part that was ever hard.
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
Ralph Loop: Teaching AI Agents to Work Autonomously (Without Burning Your Budget)
How Gluon's Ralph Loop enables autonomous Claude execution with built-in safety rails — circuit breakers, multi-signal completion detection, and cost controls that scale from simple tasks to complex workflows.
The 30 Principles for Agentic Engineering — Part 5: Calibration and Reality
Principles 26–30. The calibration layer that catches what the rest of the framework would miss: a PR-noise budget, independent verification, model-swap regression discipline, the 15-tool-call rule, and protecting junior development.
The 30 Principles for Agentic Engineering — Part 3: The Harness
Principles 15–20. The harness configuration that keeps the kernel and lifecycle cheap: CLAUDE.md under 200 lines, hooks for real incidents, skills that auto-invoke, subagent isolation, pinning, and Stage 5 distribution.