The 15-Tool-Call Rule: Where Agent Quality Falls Off a Cliff
$ grep -n "^##" 2026-05-fifteen-tool-call-rule.md>
I run four or five Claude Code agents across my projects — Gluon, scraper-mcp, dagentic — most days. I didn't go looking for a threshold; I felt one. Somewhere in the second half of a long agent run, the output stops being sharp. The agent starts repeating work it already did, re-reading files it already has, confidently editing the wrong thing. The first few times I blamed the model. Then I started counting tool calls, and the pattern was embarrassingly consistent: it was almost always somewhere past the fifteenth call in a single prompt that the quality came off.
Not gradually. A cliff. And it turns out a lot of other people have walked off the same one.
The clearest field statement I've seen is from a Reddit thread last January:
"Around 15 tool calls is ok, everything more, LLM performance degrades."
— practitioner on r/AI_Agents, 27 January 2026
It's not vendor doctrine. But it matches what Anthropic's own production engineers concluded when they instrumented the multi-agent Research system — and built explicit caps into the orchestrator: roughly 3–10 tool calls for simple fact-finding subagents, 10–15 for comparison and synthesis subagents. The number isn't magic. The constraint is real.
Why it happens
Three mechanisms compound across a long tool sequence:
Context accumulation. Every tool call returns content. Most of it isn't relevant to the next step. The agent's context fills with stale Read output, half-relevant Grep matches, error messages it already handled. By call 12, the signal-to-noise ratio is degraded enough that the model starts substituting plausible-sounding action for grounded action.
Attention drift. Long contexts dilute the original instructions. The "Don't touch auth/ in this task" qualifier from the prompt is now twelve thousand tokens behind, competing with the most recent test output for the model's attention.
Failure compounding. A wrong turn at call 5 produces an error at call 7, which the agent tries to fix at call 8, which puts it in territory it shouldn't be in by call 11. Anthropic's engineers were blunt about this in the same post: "Agents are stateful and errors compound. Minor system failures can be catastrophic for agents."
The cliff isn't a hard threshold from a paper. It's the cumulative effect of these three mechanisms hitting at roughly the same time. Fifteen is the point where most people notice it.
Three operational rules
1. Instrument the tool-call count. You cannot manage what you cannot see. Most agent harnesses now emit OpenTelemetry spans per tool call — if yours doesn't, the Claude Code OTEL integration is the cheapest fix. Build a simple dashboard: tool calls per session, distribution by tool, alert when any single turn exceeds 15. When I first put a counter on my own sessions, the uncomfortable discovery was that a routine bug-fixing prompt was already past fifteen calls by the second iteration — I'd been operating on the wrong side of the cliff for months and reading the resulting mediocrity as the model's ceiling rather than my own setup.
2. Run /compact at 50% context, not at overflow. The default impulse is to let context fill until the harness forces a compaction. The folk rule — repeated across multiple practitioner blogs and threads — is to do it proactively at around half-full. Anthropic doesn't publish a specific threshold, but the Claude Code documentation does recommend /compact as a proactive tool, not a panic button. Compacting earlier costs less, preserves more relevant context, and resets the attention-drift clock.
A useful heuristic: if your last three turns each spent four-plus tool calls, you're already on the curve. Compact now, then continue.
3. Decompose to subagents before you hit the threshold. If the task ahead clearly needs another five tool calls of investigation, that's the time to hand it off to a subagent via the Task tool — not after you've already burned the budget. Subagents get fresh context, isolated state, and a tight return contract. The orchestrator pays the round-trip cost but keeps its own context clean. This is exactly the topology Anthropic used to ship the Research feature, and the topology that the three-topology decision matrix maps to "supervisor."
The signal to delegate isn't "this task is hard." It's "this task will need state that's irrelevant to anything else I'm doing." That state belongs in a subagent, not in your main loop.
The corollary nobody states out loud
The fifteen-call rule is half about the model and half about the operator. Even if Claude 5 doubles the safe sequence length, the discipline is identical — instrument, compact proactively, decompose. The threshold moves. The shape of the curve doesn't.
The mistake that produces my worst sessions has never been running too many tool calls. It's not noticing that I have. Without instrumentation, the only signal you're past the threshold is the agent quietly producing worse work — and you, mid-flow and trusting, quietly accepting it. That's the normalisation-of-deviance trap wearing the costume of productivity.
Which lands me back where most of these agent problems land: the bottleneck isn't the model's capability, it's whether the human driving it can still tell good output from plausible output at call twenty. Instrumenting the tool-call count is just the cheapest way I've found to keep that judgment intact — to know, before I accept the diff, whether the thing that produced it was still on the safe side of the cliff.
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
The 30 Principles for Agentic Engineering — Part 3: The Harness
Principles 15–20. The harness configuration that keeps the kernel and lifecycle cheap: CLAUDE.md under 200 lines, hooks for real incidents, skills that auto-invoke, subagent isolation, pinning, and Stage 5 distribution.
Characterisation Tests Before Agents Touch Brownfield Code
Agents over-refactor stable code without a safety net. Feathers' characterisation-test technique — write tests for current behaviour before changing anything — is more important than ever. The agent itself is the perfect characterisation-test-writer.
Standardise the Harness, Customise the Work: The 5-Layer Agent Architecture
Three open-source extractions converged on the same five layers. The architecture isn't a vendor narrative — it's a discovered structure. Here's the decision rule that keeps you from over-engineering it.