The 15-Tool-Call Rule: Where Agent Quality Falls Off a Cliff
There's a number practitioners keep landing on: somewhere around fifteen tool calls per prompt, agent quality falls off a cliff. Not gradually. A cliff.
The clearest field statement I've seen is from a Reddit thread last January:
"Around 15 tool calls is ok, everything more, LLM performance degrades."
— practitioner on r/AI_Agents, 27 January 2026
It's not vendor doctrine. But it matches what Anthropic's own production engineers concluded when they instrumented the multi-agent Research system — and built explicit caps into the orchestrator: roughly 3–10 tool calls for simple fact-finding subagents, 10–15 for comparison and synthesis subagents. The number isn't magic. The constraint is real.
Why it happens
Three mechanisms compound across a long tool sequence:
Context accumulation. Every tool call returns content. Most of it isn't relevant to the next step. The agent's context fills with stale Read output, half-relevant Grep matches, error messages it already handled. By call 12, the signal-to-noise ratio is degraded enough that the model starts substituting plausible-sounding action for grounded action.
Attention drift. Long contexts dilute the original instructions. The "Don't touch auth/ in this task" qualifier from the prompt is now twelve thousand tokens behind, competing with the most recent test output for the model's attention.
Failure compounding. A wrong turn at call 5 produces an error at call 7, which the agent tries to fix at call 8, which puts it in territory it shouldn't be in by call 11. Anthropic's engineers were blunt about this in the same post: "Agents are stateful and errors compound. Minor system failures can be catastrophic for agents."
The cliff isn't a hard threshold from a paper. It's the cumulative effect of these three mechanisms hitting at roughly the same time. Fifteen is the point where most people notice it.
Three operational rules
1. Instrument the tool-call count. You cannot manage what you cannot see. Most agent harnesses now emit OpenTelemetry spans per tool call — if yours doesn't, the Claude Code OTEL integration is the cheapest fix. Build a simple dashboard: tool calls per session, distribution by tool, alert when any single turn exceeds 15. Most teams discover when they instrument this that their typical bug-fixing session is already past the threshold by the second iteration.
2. Run /compact at 50% context, not at overflow. The default impulse is to let context fill until the harness forces a compaction. The folk rule — repeated across multiple practitioner blogs and threads — is to do it proactively at around half-full. Anthropic doesn't publish a specific threshold, but the Claude Code documentation does recommend /compact as a proactive tool, not a panic button. Compacting earlier costs less, preserves more relevant context, and resets the attention-drift clock.
A useful heuristic: if your last three turns each spent four-plus tool calls, you're already on the curve. Compact now, then continue.
3. Decompose to subagents before you hit the threshold. If the task ahead clearly needs another five tool calls of investigation, that's the time to hand it off to a subagent via the Task tool — not after you've already burned the budget. Subagents get fresh context, isolated state, and a tight return contract. The orchestrator pays the round-trip cost but keeps its own context clean. This is exactly the topology Anthropic used to ship the Research feature, and the topology that the three-topology decision matrix maps to "supervisor."
The signal to delegate isn't "this task is hard." It's "this task will need state that's irrelevant to anything else I'm doing." That state belongs in a subagent, not in your main loop.
The corollary nobody states out loud
The fifteen-call rule is half about the model and half about the operator. Even if Claude 5 doubles the safe sequence length, the discipline is identical — instrument, compact proactively, decompose. The threshold moves. The shape of the curve doesn't.
The mistake that produces the worst sessions isn't running too many tool calls. It's not noticing that you've run too many. Without instrumentation, the sign that you're past the threshold is the agent quietly producing worse output — and the operator quietly accepting it. That's the normalisation-of-deviance trap dressed up as productivity.
Three rules, one count, one proactive compact, one timely subagent. Less impressive than the agent benchmarks, considerably more useful in a long Tuesday session.
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
The 30 Principles for Agentic Engineering — Part 3: The Harness
Principles 15–20. The harness configuration that keeps the kernel and lifecycle cheap: CLAUDE.md under 200 lines, hooks for real incidents, skills that auto-invoke, subagent isolation, pinning, and Stage 5 distribution.
Characterisation Tests Before Agents Touch Brownfield Code
Agents over-refactor stable code without a safety net. Feathers' characterisation-test technique — write tests for current behaviour before changing anything — is more important than ever. The agent itself is the perfect characterisation-test-writer.
Standardise the Harness, Customise the Work: The 5-Layer Agent Architecture
Three open-source extractions converged on the same five layers. The architecture isn't a vendor narrative — it's a discovered structure. Here's the decision rule that keeps you from over-engineering it.