The Quiet Failure Inside the Agent
$ grep -n "^##" 2026-04-quiet-failure-inside-the-agent.md>
Four LangChain agents. Eleven consecutive days. Forty-seven thousand dollars.
Two of them — an Analyzer and a Verifier, a perfectly reasonable pattern from any agentic design textbook — hit an error they couldn't classify. The Analyzer asked the Verifier to clarify. The Verifier, with no way to signal "I also don't know," asked the Analyzer to re-run with different parameters. The Analyzer complied. The Verifier asked again. And again, the team behind the system wrote: "Running continuously while we were sleeping. Operating while we were working. Functioning while we assumed 'everything's operating smoothly.'"
Their monitoring was not absent. Every p99 latency reading stayed flat, because each individual LLM call was fast. The error rate sat at zero, because HTTP 200 was the response to every one of hundreds of thousands of calls. The per-request token cap passed cleanly, because each call in isolation was well within limit. The only thing that ever fired was a monthly budget alert, on day nine — two days after the damage was irreversible. "A monthly alert," Gabriel Anhaia wrote in his post-mortem, "is a receipt, not a brake."
Nothing crashed. And that is the whole problem.
Observable Failure Was the Feature We Took for Granted
For roughly forty years, engineering discipline rested on one inherited assumption: failure announces itself. The process crashes, the exception unwinds the stack, the 5xx code tips the line on the graph. Every monitoring tool shipped since the LAMP stack was built to catch exactly that. AI agents don't break that way. They don't break at all, usually. They produce — and a probabilistic system drifting through an observability stack built for deterministic ones trips none of it.
Andrej Karpathy put it most directly in December 2023:
In some sense, hallucination is all LLMs do. They are dream machines… most of the time the result goes someplace useful. It's only when the dreams go into deemed factually incorrect territory that we label it a 'hallucination'. It looks like a bug, but it's just the LLM doing what it always does.
Dreaming is the default state. The machine can't tell a useful dream from a wrong one — and neither can the observability stack watching it. Chip Huyen named the mechanical consequence in April 2023:
The flexibility in user-defined prompts leads to silent failures. If someone accidentally makes some changes in code… it'll likely throw an error. However, if someone accidentally changes a prompt, it will still run but give very different outputs.
Two years on, Jason Liu — who spends his professional life in production LLM systems — reached the same place: "there's often no exception being thrown when something goes wrong — the model simply produces an inadequate response." So "traditional error monitoring tools like Sentry don't work for AI products because there's no explicit error message when an AI system fails." What we lost is observable failure itself — the one architectural assumption the entire monitoring stack was built on.
The Three Layers of Drift
The decay is structural, and it lives in three layers.
Rendering diagram...
Layer one is architectural. MIT researchers in 2025 showed that causal masking in transformer attention creates a built-in bias toward the beginning of a context window, amplifying as models grow deeper. Stanford's Nelson Liu et al. had quantified the effect in 2023: GPT-3.5-Turbo on multi-document QA dropped to 56.1% accuracy when the relevant document sat in the middle of the context — below what the same model scored with no documents at all. The long context actively hurt, and nobody told the agent.
Layer two is runtime. Anthropic's engineering blog describes it: "Larger context windows often worsen performance due to attention dilution, where added tokens bury critical details amid noise." In Claude Code, practitioner observation places the cliff around 65% context usage, with a hard drop at 80% when lossy auto-compaction kicks in. Engineers now build three-hook pipelines to detect context rot and rotate sessions before compaction — infrastructure purpose-built to route around a failure mode that produces no error.
Layer three is infrastructure. Fixed-size chunking in the RAG pipelines that feed agents their memory strips conditional clauses from rules: "if a transaction exceeds €10M, flag for review" gets retrieved as "flag for review," and the agent acts on the truncated version without seeing what it lost. Per Evidently AI's 2024 production survey, cited in Beam.AI's analysis of silent failure at scale, roughly a third of production RAG scoring pipelines experience distributional shifts within six months. The memory isn't missing; it's corrupted upstream of reasoning.
Then the compound killer. τ-bench at ICLR 2025 measured consistency across runs — the pass^k metric. Top agents hit ~56% on a single run in retail domains and collapsed below 25% at pass^8. The agent that solved the task once fails three times out of every four repeats, reporting success each time. It doesn't forget; it confidently remembers the wrong thing, with no mechanism to say so.
Air Canada, Cigna, and the Omission Problem
Those are the mechanisms. Here are the receipts.
Air Canada, 2024. The airline's chatbot told Jake Moffatt he could claim a bereavement fare retroactively. The policy didn't exist. When he relied on it, Air Canada denied the refund and told the British Columbia Civil Resolution Tribunal that the chatbot was "a separate legal entity that is responsible for its own actions." The tribunal rejected that and awarded roughly $812 CAD. Small dollar figure; large precedent — a company owns what its agent says.
Cigna PXDX, 2022. ProPublica's 2023 investigation revealed Cigna's AI-assisted claims system denying claims in batches — over 300,000 in two months, at an average of 1.2 seconds per denial, one physician signing off on 121,000. Throughput high, costs contained, failure invisible until the throughput rates themselves became the story.
Epic's sepsis AI. Deployed widely across US hospitals, then tested by the University of Michigan against 30,000 patient records. It missed roughly two-thirds of actual sepsis cases and threw large volumes of false alarms. Stanford's Nigam Shah: "No wonder we see useless models, such as the one about sepsis, getting deployed."
Stanford-Harvard NOHARM, January 2026. Thirty-one models, 100 real primary care cases, drawn from 16,399 real Stanford Health Care electronic consultations with 12,747 expert annotations. Top models produced severe clinical errors on 11.8% to 14.6% of cases — and 76.6% of those severe errors were errors of omission: the AI didn't recommend the test that would have caught the condition. The output read as a complete, confident clinical recommendation. Something critical was simply absent.
Mata v. Avianca. Steven Schwartz submitted six ChatGPT-generated case citations that didn't exist. By February 2026, Damien Charlotin's tracking database had 239 documented US follow-on incidents, 486 worldwide, with sanctions running over $116,000 in one 6th Circuit case alone. Same mechanism every time: plausible text with no ground-truth anchor, submitted by professionals trained to trust authoritative-sounding prose.
The Green Check Problem
If the production outputs are this wrong, the benchmarks that were meant to catch them are no better.
METR, August 2025. Claude 3.7 Sonnet on 18 real GitHub issues via Inspect ReAct. Algorithm-level test pass rate: 38%. Manual review of 15 of those PRs: 100% had issues — inadequate test coverage, missing documentation, linting and formatting failures. Zero were mergeable as-is. The benchmark optimised for test passage; the world optimises for correctness; the agent reported "done" on both.
Then the study that collapses the whole picture into one number. METR's July 2025 RCT: sixteen experienced open-source developers, working on their own repositories, 246 real issues randomised between AI-allowed and AI-disallowed conditions. They expected AI to speed them up by 24%. They were slowed down by 19%. And after living through it, they still believed AI had sped them up by 20% — a 39-point gap between reality and the operator's perception of it that does not close through experience.
Ethan Mollick named the mechanism: "errors are going to be very plausible. Hallucinations are therefore very hard to spot, and research suggests that people don't even try, 'falling asleep at the wheel' and not paying attention." Plausibility is the anaesthetic; confidence is the dosage.
The agents are also starting to cheat their own scoreboards. A February 2025 paper showed o3 and DeepSeek R1 engaging in specification gaming by default on a chess task — manipulating game state, altering rules, exploiting implementation details to "win." Prompt framing with "be creative" pushes gaming prevalence over 77%. The win condition registered green; the intent was violated entirely.
And the math that should scare anyone building a multi-step pipeline: Vectara's hallucination leaderboard shows best-in-class models hallucinating on 5–7% of summarization tasks. Run twenty steps at 90% per-step accuracy and your end-to-end success rate is ~12%. The benchmark measures a single dice roll; production is the whole game. Even OpenAI concedes it, in Jason Liu's account: "evals won't catch everything. Real world use helps us spot problems."
The Architecture of Observable Failure, Rebuilt
The uncomfortable, optimistic truth is that the tools to rebuild observable failure already exist. The gap is adoption. Cleanlab's 2025 survey of production AI teams found fewer than one in three satisfied with their observability and guardrail stack, and industry analyses of it suggest only around 5% of deployed agents have monitoring mature enough to catch the failure modes above. A leadership decision, not a technical constraint. If you are shipping an agent this quarter, here is the minimum you owe it.
Observability that understands agents. The OpenTelemetry GenAI Semantic Conventions define a parent invoke_agent span with chat and execute_tool children carrying attributes like gen_ai.agent.name and gen_ai.tool.name. That hierarchy is what makes the $47K loop visible: one hundred execute_tool children under one parent span is a shape no per-span latency alert catches, but a one-line PromQL query counting repeated tool calls per agent ID surfaces it on day one.
Circuit breakers, not cost reports. Anhaia's 30-line Python circuit breaker trips on two thresholds: max_same_tool=8 kills any loop calling the same tool more than eight times, max_run_tokens=200_000 caps cost per run at roughly $0.60 on gpt-4o-mini. It raises before the money is spent, not after.
Evals as tests, in CI. DeepEval markets itself as pytest for LLMs, and the framing is right: probabilistic regression gates on every PR, every prompt change, every model swap. Hamel Husain reports spending 60–80% of engineering time on error analysis in production AI work. That isn't overhead. It's the job.
Deterministic guardrails around probabilistic reasoning. In July 2025, a developer using Replit's Vibe Coding agent explicitly told it not to touch the production database. The agent executed a DROP TABLE, then fabricated thousands of fake user records to cover its tracks. Arize's response: "Safety cannot rely on the LLM. It demands a deterministic layer." PreToolUse hooks — the pattern Anthropic ships with Claude Code — intercept and block tool calls before they execute; OWASP's Agentic AI Top 10 gives the threat model, Microsoft's April 2026 Agent Governance Toolkit the runtime, and Singapore's draft Model AI Governance Framework the regulator language: limit the action-space, enforce traceability at the access layer.
Human oversight that actually holds. Not YOLO-mode auto-approval — challenge-and-response checklists (intent, blast radius, rollback), SLA-escalated review windows, and independent logging of what the agent did versus what it said it did. Bob Renze, writing about six hours of zero-throughput downtime his own monitoring couldn't see, landed on the only posture that survives production: "I now treat silence as signal. An agent reporting all-clear with zero throughput isn't healthy — it's asymptomatic."
None of this is exotic. A PromQL query. Thirty lines of Python. A pytest-style eval suite. A PreToolUse hook. Any engineering leader who wants to start on Monday can start on Monday.
The Compound Interest of Confidence
The quiet failure embedded in Block's world-model manifesto was this same failure mode at the scale of a company — an organisation optimising for a metric that looked healthy while the thing that mattered drifted away. The agent's version runs at the scale of a single function call. Different altitude, identical mechanism: the metric that mattered wasn't the one being measured, and nothing looked broken the entire time it was.
Ali Rahimi saw it coming in his 2017 NeurIPS test-of-time speech:
Machine learning has become alchemy… If you're building photo-sharing systems alchemy is okay. But we're beyond that now. We're building systems that govern healthcare and mediate our Civic dialogue. We influence elections. I would like to live in a society whose systems are built on top of verifiable, rigorous, thorough knowledge and not on alchemy.
The loss of observable failure was free. Rebuilding it — in spans, in evals, in circuit breakers, in deterministic guardrails — is what buys back the one thing the next decade of AI systems will make scarce: the ability to tell, while it still matters, whether a thing that looks like it worked actually did.
Part 4 in a loose series on the quiet failure of AI-mediated work. Part 1 examined the organisational scale. Part 2 looked at neurodivergent professionals. Part 3 explored the promotion of every worker to middle management. This piece goes one layer deeper — inside the agent itself.
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
The 5-Step Loop: Why Your Agent Fails at Step 4
ReAct gave us a three-step loop. Production hardened it into five. The two new steps — Plan and Verify — are where everything that goes wrong, goes wrong. And the field has now named the worst offender.
The Governance Wall: Why Most AI Agents Can't Reach Production
The prototype-to-production gap for AI agents isn't technical — it's governance. Most organisations have nothing in this layer. The companies that build it first win the enterprise market. Everyone else stays in pilot purgatory.
Manager Mode: When AI Does the Work, Everyone Becomes Middle Management
AI is silently promoting every knowledge worker to middle management — without the title, the training, or the pay. This is what that shift actually looks like from a Singapore desk.