Two Papers That Puncture the Hype
Two numbers get quoted at you constantly: the size of the context window and the amount of "thinking" a reasoning model does. A million tokens. Extended reasoning. More room, more thought, more capability — that's the pitch.
Two 2025 research papers took those two numbers apart. One is clean and largely uncontested. The other set off a fight that's worth understanding, because the fight is where the real lesson lives. Read both carefully and they point at exactly the same engineering response — the one that underpins the 15-tool-call rule I wrote about earlier.
Paper one: the context window is not free
Context Rot: How Increasing Input Tokens Impacts LLM Performance, published by Chroma's research team (Kelly Hong, Anton Troynikov, Jeff Huber) in July 2025, tested a simple assumption and found it false. The assumption, in their words:
"Large Language Models (LLMs) are typically presumed to process context uniformly — that is, the model should handle the 10,000th token just as reliably as the 100th. However, in practice, this assumption does not hold."
They ran 18 frontier models — Claude Opus 4 and Sonnet 4, GPT-4.1, o3, Gemini 2.5 Pro and Flash, the Qwen3 family — through deliberately simple tasks, holding difficulty constant and varying only input length. Every model degraded as the context grew. Not on hard tasks. On trivial ones.
The findings that should change how you work:
- Retrieval decays with distance and ambiguity. A "needle" buried in a long context is found less reliably as the haystack grows, and the effect worsens when the question isn't a near-exact lexical match for the answer.
- A single distractor measurably hurts. One plausible-but-wrong passage drags performance down — and the effect compounds with length.
- Coherence can hurt. Counterintuitively, a logically structured haystack sometimes produced worse retrieval than a shuffled, incoherent one. The model gets distracted by the narrative.
- It shows up in conversation. On a long-memory QA task, models given the full ~113k-token history did markedly worse than models given a focused ~300-token slice. In one example, Claude Sonnet 4 answered "I cannot determine the number of days" — when the dates were sitting right there in its context.
There's a commercial disclosure worth making: Chroma sells a vector database, so they have an interest in arguing that giant context windows don't replace retrieval. But the code is public, the methodology is sound, and the result has been widely reproduced. The finding stands on its own: a bigger context window degrades long before it fills up.
Paper two: the thinking is not free either
The Illusion of Thinking, from Apple (Shojaee, Mirzadeh, Bengio, Farajtabar et al., June 2025), aimed at reasoning models — the o3-style, extended-thinking systems sold on their ability to "think harder." Apple tested them on scalable logic puzzles — Tower of Hanoi, Blocks World, River Crossing, Checker Jumping — where you can dial complexity up step by step. The headline findings:
- A three-regime structure: on low-complexity problems, plain models actually beat reasoning models; on medium complexity, reasoning models pull ahead; on high complexity, both collapse to near-zero accuracy.
- A genuinely strange result: as problems approached the collapse point, the models reduced their reasoning effort — spent fewer thinking tokens on harder problems — despite having budget left.
That second finding is the interesting one, and it's the one that survived what came next.
The fight — and what survives it
The Illusion of Thinking got hit hard, and fairly, on its methodology. The main rebuttal, The Illusion of the Illusion of Thinking (Lawsen, Open Philanthropy, June 2025), made three specific points:
- Tower of Hanoi at N=15 needs 32,767 moves (2¹⁵ − 1). Asking a model to enumerate every move exceeds its output token budget. That's a printing-paper limit, not a thinking limit.
- The evaluator couldn't tell the difference between "ran out of room to write the answer" and "couldn't reason it out."
- Some River Crossing instances were mathematically unsolvable at the tested sizes — the models were marked wrong for failing impossible problems.
Lawsen showed that when you ask the model to write a function that generates the Hanoi solution rather than spell out every move, it succeeds on instances Apple had scored as total failures. A second paper, Rethinking the Illusion of Thinking (Varela et al., July 2025), did the careful version: it found the River Crossing failures were indeed mostly the unsolvable-instance artefact — "Once we limit tests strictly to solvable problems—LRMs effortlessly solve large instances involving over 100 agent pairs" — but that the Tower of Hanoi failures were not purely an output artefact: "LRMs still stumble when complexity rises moderately, around 8 disks."
So here's the honest scorecard. The strong reading — "reasoning models can't actually reason" — does not survive; a chunk of the collapse was experimental error. But the core does: there is a real compositional-complexity ceiling, the three-regime structure holds, and the "effort decreases as you approach the wall" behaviour was never refuted. Reasoning models hit a wall. The wall is lower than the hype implies and higher than the headline claimed.
This is what good science looks like from the outside — a strong claim, a sharp rebuttal, and a more durable claim left standing in the middle. Cite the middle.
Both papers point the same way
Strip each paper to its engineering consequence and they converge:
Rendering diagram...
Context Rot says: don't pour everything into the window — reliability falls long before you hit the token limit, so retrieve narrowly and keep the working set tight. The Illusion of Thinking says: don't just turn up the thinking dial — there's a complexity ceiling, and past it the model quietly gives up rather than grinding harder.
Both are arguments against the same instinct: the belief that more — more context, more reasoning, more autonomy — is a free lunch that scales smoothly. It doesn't. Capability is bounded on both axes, and the bounds arrive earlier than the spec sheet suggests.
The response is the same one I keep arriving at from every direction. Decompose the work into pieces small enough to stay inside the reliable zone. Compress context proactively rather than letting it balloon. Verify each unit before moving on. Hand long-horizon work to sub-agents with fresh context rather than one ever-growing conversation. The 15-tool-call rule is the field-observed version of what these two papers measured in the lab: quality falls off a cliff when you let any single unit of work get too big.
The takeaway
The hype sells two dials marked more. The research says both dials have a wall, both walls are closer than advertised, and the engineering that works is the engineering that respects them. None of this means the models are weak — they're extraordinary inside their reliable zone. It means the job is knowing where that zone ends, and building so each piece of work stays inside it.
Read the papers, not the press releases — and when a paper gets rebutted, read the rebuttal too. The durable claim is almost always smaller than the headline and more useful than the hype.
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
The 30 Principles for Agentic Engineering — Part 5: Calibration and Reality
Principles 26–30. The calibration layer that catches what the rest of the framework would miss: a PR-noise budget, independent verification, model-swap regression discipline, the 15-tool-call rule, and protecting junior development.
The 30 Principles for Agentic Engineering — Part 3: The Harness
Principles 15–20. The harness configuration that keeps the kernel and lifecycle cheap: CLAUDE.md under 200 lines, hooks for real incidents, skills that auto-invoke, subagent isolation, pinning, and Stage 5 distribution.
The 30 Principles for Agentic Engineering — Part 2: The Lifecycle
Principles 6–14. How work moves through an agentic engineering team: the ticket as contract, AI distillation with human curation, three gates, verification before done, characterisation tests, the 1.2× capacity rule, the J-curve, and telemetry.