The Productivity J-Curve: Why Your AI Pilot Looks Worst at Week 6
$ grep -n "^##" 2026-05-productivity-j-curve-week-6-ai-pilot.md>
The 39-Point Reversal
Sixteen experienced developers. Two hundred and forty-six real tasks on mature open-source repos they'd contributed to for years. Cursor Pro, Claude 3.5 and 3.7 Sonnet. Randomised between AI-allowed and AI-forbidden, February to June 2025. Published 10 July as arXiv:2507.09089 by Becker, Rush, Barnes and Rein at METR.
The headline finding is the cleanest sign reversal in the field:
"After completing the study, developers estimate that allowing AI reduced completion time by 20%. Surprisingly, we find that allowing AI actually increases completion time by 19% — AI tooling slowed developers down."
Plus twenty perceived. Minus nineteen measured. The confidence interval was +2% to +39% — a real slowdown, just not a precise one. Both expert forecasters and the developers themselves had predicted a speed-up north of 24%. They got the sign wrong.
That is not a tooling story. It is a measurement story. The instant week 6 of a pilot lands on a CTO's desk with positive developer sentiment and flat delivery metrics, the gap becomes visible. The tool has not failed. The measurement window has — and the failure isn't a quirk of one study. It's the shape of every general-purpose technology, and it has a name.
The Shape Has a Name
In 2006, the political scientist Ian Bremmer published The J Curve to explain why authoritarian states become more unstable, not less, when they begin to open up. Stability falls before it rises. The right-arm peak is higher than the start — but only if you survive the bottom.
Twelve years later, the economists Erik Brynjolfsson, Daniel Rock and Chad Syverson lifted the same shape into macroeconomics. Their 2018 Productivity J-Curve paper showed that general-purpose technologies — electricity, the PC, AI — require large intangible investments in skills, processes and data redesign before the gains show up in the statistics. Measured output falls, then rises higher than before. Same shape, longer time constant.
Then, in July 2025, Kristina McElheran and co-authors put numbers on it for AI specifically: US Census Bureau data across tens of thousands of manufacturing firms showed an average short-run productivity decline of 1.33 percentage points, with older firms seeing the deepest dips before pulling ahead. McElheran's summary in MIT Sloan is the line to keep:
"That initial dip — the downward slope of the J-curve — is very real."
Rendering diagram...
METR's −19% lands around week 6 for experienced developers — the measurable trough. The right-arm magnitude is organisation-specific; the shape is not. Geopolitics, macro, manufacturing — three independent literatures, one curve. So what makes week 6 specifically so dangerous?
What Week 6 Actually Feels Like
The marathon runner's wall hits at mile 20 of 26 — the worst possible mile to quit at, past the halfway commitment, not yet close enough to see the finish. Trained runners know it's coming. Untrained runners read it as failure. Week 6 of an AI pilot is mile 20.
The developers feel faster. Cognitive load is genuinely lower — the tool drafts the boilerplate, suggests the test, summarises the diff. The screen recordings tell a different story: more idle time, more verification overhead, more deletion. The longitudinal version lands at ICSE'26, where Sergeyuk and colleagues tracked 800 developers over two years:
"Telemetry reveals that AI users produce substantially more code but also delete significantly more. Meanwhile, survey respondents report productivity gains..."
The perception gap is not a METR artefact; it persists at two-year scale. And the throughput-without-quality signature is exactly what Faros AI's 2026 telemetry across 22,000 developers shows: epics per developer up 66%, incidents per PR up 242%, median PR review time up 441%. More output, more rework. Selenium's Jason Baum gave it the four-word version on the Coder podcast:
"We're running before we walk with AI."
At week 6 the three signals are in open conflict. Self-report is positive, throughput is flat, verification overhead is rising. Which signal does leadership trust?
I've watched this gap on my own work, at a fraction of the scale. I run four or five Claude Code agents across my projects and I measure them obsessively — cost, elapsed time, lines changed — because after twenty years of building I trust a number over a feeling. For a stretch, the feeling said I was flying. The bug queue said otherwise: the rework I'd waved through was quietly catching up with me. I plan my own throughput at something close to 1.3× now, and even that took a few sprints of watching the dip close before I believed the right arm was real. If a solo developer who measures everything can misread his own curve, a team running on sentiment doesn't stand a chance.
The Same Shape, Every Time
The right arm of the J is not theoretical — it shows up at every documented scale.
Google's DORA 2026 ROI of AI-Assisted Software Development report bakes the dip into its public calculator as a parameter — j_curve_drop=0.15, j_curve_duration=3. A 15% drop for three months, applied honestly, still yields a 39% first-year ROI and an 8-month payback. The dip is a line item. DORA's verdict on what goes wrong is the sentence to memorise:
"Initiatives often fail not because the technology is flawed but because leadership misinterprets this learning phase as a failure and pulls funding during the inevitable dip."
The case studies say the same thing at different time constants. Box paired power users one hour a week with newer adopters and in six weeks lifted Cursor usage by 75% toward 85%+ daily adoption across 800+ engineers. Spotify started investigating background coding agents in February 2024; twenty-one months later Honk had merged more than 1,500 agent-created PRs into production. Goldman Sachs spent months of intensive preparation before fleet-wide Devin deployment, targeting a 3–4× step-change per CIO Marco Argenti.
Different scales, same shape. Every one rode through a measurable dip, and the ones who bought the J-curve as a line item came out the other side. The strongest evidence the right arm exists may be METR's own February 2026 follow-up: the slowdown narrowed to −4% with the confidence interval crossing into positive territory, and 30–50% of developers refused tasks that required working without AI. Selection bias makes the number unreliable. The refusal is the story.
But knowing the dip exists doesn't help if you're measuring the wrong things at the bottom of it.
What to Measure During the Dip
If you are a CTO with a pilot in week 5, here are the lines to put on next week's dashboard — and the ones to take off.
Stop watching: PR count, story points, lines of code, raw velocity. TechEmpower's Tony Karrer, from direct rollout observation: "Let adoption mature... your measure should focus on adoption and use to enable coaching."
Start watching, six signals:
- Adoption — at least 30–40% of provisioned engineers in weekly use by week 8.
- Diffusion — top 10% of users contributing no more than 50% of usage. Above that, you have a pet project, not a rollout.
- Cycle time, segmented — AI-touched PRs within ±15% of non-AI is healthy at week 8. Improvements are bonus.
- Change failure rate, segmented — AI-touched CFR within ±2–3 points of non-AI. Wider than that in sensitive domains, narrow the surface area first.
- Verifier catch rate — first-pass review rate trending up. The leading indicator of quality before throughput moves, and the instrumentation for the Verify step I covered in The 5-Step Loop.
- Cost per developer per day — alert at 2× baseline. Runaway is structurally different from learning.
There is one genuine cut signal, structurally different from the dip. If post-merge defect rate sits at 2× pre-adoption baseline for four consecutive weeks with no downward trend, pause and debug the harness. That is the stacking-staircase pattern from The Quiet Failure Inside the Agent — compounding, not transient. Cut domain-specific before org-wide.
All of which assumes the conversation about the dip has already happened. If it hasn't, the dashboard will not save you.
Tell Your CFO Before You Sign the Contract
The most leveraged tactic in this entire post is pre-commitment: budget the J-curve as a line item in the first-year cost model before the contract is signed. Loss aversion is why this matters. Kahneman and Tversky's work on prospect theory is one long demonstration that humans process surprise losses far more harshly than expected ones. A 15% drop modelled as a 15% drop is on plan; the same drop unmodelled is a crisis. Identical dashboard, opposite reaction.
Next week I'll cover what the right arm actually settles at once you net out downstream incident cost. Vendors will tell you 5–10×. Independent measurement says 1.2–1.5× net once you count the incidents the agents shipped. Two earlier pieces sit alongside this one: The Governance Wall named the five primitives every rollout needs before deployment, and The 5-Step Loop named the verification step the dashboard above depends on.
The dip is survivable. But if your CFO doesn't know about it, the dashboard is academic. Tell them before the contract is signed.
Series Navigation
- Series opener: The Governance Wall — Why AI Agents Stall Before Production
- Part 1: The 5-Step Loop — Why Your Agent Fails at Step 4
- Part 2: The Productivity J-Curve — Why Your AI Pilot Looks Worst at Week 6 (you are here)
- Part 3: The Honest Productivity Number (coming soon)
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
1.2× Not 10×: The Honest Productivity Number Nobody's Publishing
GitHub said 55%. Then they ran the enterprise RCT and got 8.69%. Faros's two-year telemetry shows throughput up 66% and incidents up 243%. The honest net is 1.2–1.5×. Plan your team capacity accordingly.
Your AI Team Did Nothing While You Slept
Anthropic let Claude run a real shop for a month. It sold metal cubes at a loss, invented a Venmo account, and claimed to wear a blazer. The 'AI department that works while you sleep' is a genre — here's where it actually breaks.
The 5-Stage Maturity Model for AI-Augmented Engineering Teams
Most teams plateau at Stage 2 because they confuse 'we built skills' with 'we have a working AI engineering culture.' Here's the 5-stage diagnostic — and the moves that get you from Individual to Distributed.