Ralph Loop: Teaching AI Agents to Work Autonomously (Without Burning Your Budget)
Part 3 of 4 in Gluon: Building an AI Agent Orchestrator series
You've built the workflow. You've designed the task. You hit "resume," walk away for five minutes, come back to another question, and hit resume again. Repeat that loop forty times. At scale, this becomes a human bottleneck — exactly what autonomous agents are supposed to solve.
Ralph Loop solves this. It's the system that lets Claude work continuously, making its own judgments about progress and completion, until the work is actually done. Named after Frank Bria's original bash script, Ralph represents Gluon's philosophy of autonomy with safety rails, not autonomy without them.
The core insight is simple but non-obvious: autonomous execution requires three interlocking systems working together. Remove one, and the whole thing fails — either it stops when it shouldn't, costs money when it shouldn't, or keeps running when it definitely shouldn't.
The Autonomy Problem
Let's be honest about what we're trying to solve. In interactive mode, you're present. Claude makes a suggestion, you evaluate it, you approve or redirect. In autonomous mode, Claude works unattended until the job is done.
The problem is straightforward: how does Claude know when the job is actually done?
And behind that: how much does this autonomy cost?
The research is brutal. Research suggests the vast majority of AI pilots fail in production deployments. The most common reason isn't that the AI was bad — it's that human-in-the-loop becomes the weak link. Teams get flooded with thousands of daily approvals. Alert fatigue sets in. They switch to "auto-approve" modes. Unintended autonomy escalation. Chaos.
The insight that changed everything: human-in-the-loop isn't the answer. It's the constraint.
"We're cooperating with AI, they generate and humans verify. It is in our interest to make this loop go as fast as possible, and we have to keep the AI on a leash." — Andrej Karpathy
The leash has to be smart. It has to know when to hold on, when to let go, and when to stop the agent cold before something expensive happens.
The Circuit Breaker Deep Dive
I borrowed the metaphor from electrical engineering. In your house, a circuit breaker is a three-state device: it sits in the CLOSED position under normal conditions, letting current flow freely. If something goes wrong — an overload, a short circuit — it flips to HALF_OPEN as a warning, checking whether conditions have stabilized. If they haven't, it locks into the OPEN position, cutting the circuit entirely until someone resets it manually.
Ralph Loop uses the same pattern.
CLOSED is the normal state. Claude is working, iterating, making progress. The circuit breaker watches for signs of trouble: files changing, errors being resolved, tokens being spent productively. As long as something is happening, stay closed.
HALF_OPEN triggers when the circuit breaker detects a stall: either Claude hasn't modified any files for 5 consecutive iterations, or it's caught in a repeated error loop. Now the system watches carefully. Is Claude recovering? Making a fresh attempt? If yes, reset to CLOSED. If not, grant a patience window—three more loops to prove progress before tripping OPEN.
OPEN means manual intervention required. The loop stops. The run moves to REVIEW status. A human needs to look at what happened and decide whether to resume, modify the prompt, or start over.
Rendering diagram...
Here's why it matters: this prevents runaway loops. Early Gluon iterations without a circuit breaker saw a single run blow through $500 in tokens. Another lost two hours stuck in retry hell—Claude convinced the error was its fault when the system just needed a restart.
The circuit breaker stops that cold. It's the valve that turns autonomy from risky to trustworthy.
Completion Detection: The Hard Problem
This is where it gets tricky. The circuit breaker stops the loop from running forever. But how does Claude know when to stop the loop voluntarily — before the circuit breaker has to step in?
I could ask Claude to output "DONE" when finished. That doesn't work. Claude gets creative—outputs "done" midway through because one section completed. Or: "I've finished the implementation, no tests written yet" when the requirement was implementation plus tests.
Single signals fail.
So I built a multi-signal completion detector: four independent signals, each scoring confidence. The loop exits only when enough signals align—or when the strongest signal fires with high confidence.
Rendering diagram...
Signal 1: RALPH_STATUS Block (+50 confidence if EXIT_SIGNAL=true)
At the end of each iteration, Claude can output a structured status block:
---RALPH_STATUS---
STATUS: IN_PROGRESS | COMPLETE | BLOCKED
TASKS_COMPLETED_THIS_LOOP: 3
FILES_MODIFIED: 2
TESTS_STATUS: PASSING
WORK_TYPE: IMPLEMENTATION
EXIT_SIGNAL: false
---END_RALPH_STATUS---
If Claude explicitly sets EXIT_SIGNAL to true, the system assigns a 50-point confidence boost. This is Claude's explicit contract: "I have determined the task is complete."
Signal 2: Keyword Detection (+10–15 confidence)
Words like "done," "complete," "finished," or "all tasks complete" add confidence. Weak individually—Claude might use "done" for a completed step—but they stack to form a stronger signal.
Signal 3: TODO File Parsing (+40 confidence)
At startup, Gluon scans for task definition files—TODO.md, @fix_plan.md—and parses the checkboxes. 100% completion gets +40 confidence. This is strong because it's explicit, auditable, and aligned with how humans track work.
Signal 4: Test Saturation (automatic exit after 3 consecutive test-only loops)
If Claude runs three iterations in a row without writing any new code, just running tests, that's a sign the work is done and we're just verifying. Exit automatically.
The system adds up these confidence scores. The threshold is 60% confidence, or when multiple independent signals fire consecutively. In practice, this exits far more often on actual task completion than on safety intervention.
Why not wait for higher confidence? Because multiple weak signals combined are more reliable than waiting for a single strong signal. Claude might spend four iterations on a task that's actually complete, just because it doesn't output an explicit EXIT_SIGNAL. By using multi-signal detection, we get faster, more reliable exits with fewer false positives.


Cost Controls & Rate Limiting
The rate limiter is the guardrail that keeps costs predictable.
Ralph Loop defaults to 100 API calls/hour—the circuit breaker on consumption. Bump it to 150 for testing, 200 for aggressive production autonomy. You set the limit; the system enforces it.
You can also set a cost cap. --max-cost 5.00 means the run will not spend more than $5 on tokens. When it hits the limit, it exits to REVIEW status — the same state as a circuit break. You can review what happened and resume manually if you want.
Token costs dominate: 70–90% of Gluon's variable spend. Model selection compounds this dramatically. Haiku runs at 1/5 Sonnet cost; Sonnet at 1/5 Opus. Run a Formula Workflow on Opus throughout, and you pay 5x more for identical results—or use Haiku for fast tasks and Opus for planning, and save 80%. This is where Ralph Loop's cost efficiency shines.
Here's what the cost profile looks like:
- Simple code generation: $0.05–$0.50 per session → $150–$1,500/month at scale
- Autonomous with Ralph Loop: $1–$5 per session → $3,000–$15,000/month at scale
- Complex multi-step workflows: $5–$15 per session → $15,000–$45,000/month at scale
The rate limiter doesn't slow things down. It prevents surprise bills.
The Supervision Daemon
Rate limiting handles peak protection. But what about interrupted runs, or phases that complete and need auto-resumption? The supervision daemon handles that.
Picture a train conductor checking the board every 30 seconds for trains waiting to depart. Same questions each time: Safe? Track available? Signal green? Dispatch or wait.
Gluon's supervisor daemon works the same way. It polls every 30 seconds for any runs in REVIEW status. For each one, it checks a series of safety conditions:
- Is the circuit breaker in CLOSED state? (Loop is making progress, not stuck)
- Has the cost cap been reached? (We're not overspending)
- Have we hit the hourly rate limit? (We have API capacity)
- Has at least 60 seconds passed since the last resume attempt? (Avoid rapid restart loops)
- Have we auto-resumed this task fewer than 5 times? (Fallback to manual if patterns repeat)
If all checks pass, the daemon auto-resumes the task based on the configured policy:
AGGRESSIVE: Resume with minimal checks. Get this done fast. The system trusts that the safety conditions are sufficient.
CONSERVATIVE: The default. Strict safety checks, longer wait between resumes. More cautious.
MANUAL: Never auto-resume. I want to decide when this resumes. The run stays in REVIEW indefinitely until a human says "go."
The supervision daemon maintains a full audit trail of every resume decision: which checks passed, which policy was applied, when the resume happened, what the result was. This transparency is crucial for production deployments where you need to show stakeholders exactly how an autonomous system is making decisions.
Rendering diagram...
Formula Workflows: Declarative Multi-Step Autonomy
So far, we've talked about a single autonomous task: run Ralph Loop until the job is done. But what if you want to express complex workflows — like Plan, Implement, Test, Review — as a coherent pipeline?
That's where Formula Workflows come in.
A Formula is a YAML-defined multi-step pipeline. Each step is its own Claude iteration, but they execute as a single unified run sharing a git worktree. The first step creates the run and worktree. Subsequent steps resume with a fresh Claude session — new context, clean slate — but they're all working on the same codebase.
Here's what that looks like:
name: feature
steps:
- id: plan
model: opus
prompt: "Plan the feature implementation..."
- id: implement
model: sonnet
prompt: "Implement based on the plan..."
- id: test
model: haiku
prompt: "Write and run tests..."
- id: review
model: opus
prompt: "Review the implementation..."
Notice the model selection: Planning (complex) → Opus. Implementation (straightforward after planning) → Sonnet. Testing (fast, low-risk) → Haiku. Review (high-stakes) → Opus.
This right-sizing cuts costs 80% versus Opus-everywhere, zero quality loss. The dashboard shows one card with progress: "Step 2/4 Implement."

After each step, Blueprint Validation kicks in:
-
Auto-fix: Run linter auto-fix (
ruff check --fix && ruff format .). Most style issues self-resolve. -
Lint loop: Lint errors remain? Resume Claude to fix (up to 3 retries). The "humans write, AI fixes style" pattern.
-
Test gate: Run the suite. Tests fail? Resume Claude for fixes. You don't advance until tests pass.
Formula Workflows become self-healing. You express the shape in YAML. Gluon handles iteration, retry, validation.
Safety as a Feature, Not a Limitation
Here's the reframe: these controls don't slow things down. They make scale possible.
No circuit breaker? Runaway loops drain budgets overnight. No completion detection? Claude burns tokens long after finishing. No rate limiting? Monthly bills become gambling. No supervision? Tasks rot in REVIEW indefinitely.
These systems form the trust foundation. Implement them right, and you earn autonomy. You delegate confidently because guardrails catch disasters before they happen. The leash isn't a constraint—it's how you scale trust. How you go from "autonomous agents are risky" to "autonomous agents are the default way we work."
Gluon embodies this philosophy. Every autonomy feature has a counterpart safety system—Ralph needs the circuit breaker, Formulas need Blueprint Validation, supervision is non-negotiable.
The result: scalable autonomous execution. From $2 prototype tasks to $50 enterprise workflows—all completing reliably, staying within budget, and running unattended.
Series Navigation
- Post 1: From tmux Chaos to AI Agent Orchestration
- Post 2: Inside the Cockpit
- Post 3: Ralph Loop — Autonomous Execution (you are here)
- Post 4: From Solo Tool to Team Infrastructure
Why I Built Gluon: From tmux Chaos to AI Agent Orchestration
When orchestrating 4-5 Claude Code agents across projects, I was losing track of progress and cost. A 2-3 day build with Claude Code itself led to Gluon—an open-source orchestrator that treats AI agents like team members.
From Solo Tool to Team Infrastructure: Scaling Gluon for Production
When I first built Gluon on my Mac mini, I was solving a personal problem: monitoring Claude agents without losing my mind to tmux logs. But when teams join the picture, everything changes—security, governance, observability, and the fundamental role of the developer. Here's what production infrastructure for autonomous agents looks like.
The Hidden Arsenal: How My Dotfiles Unlocked 10x Productivity with AI Coding Assistants
After 12 months of systematic optimization, I've documented 50-70% productivity gains with AI coding assistants. The secret isn't just using AI tools—it's teaching them to think like you do through carefully crafted configurations.