The Productivity J-Curve: Why Your AI Pilot Looks Worst at Week 6
The 39-Point Reversal
Sixteen experienced developers. Two hundred and forty-six real tasks. Mature open-source repositories they had been contributing to for years. Cursor Pro, Claude 3.5 and 3.7 Sonnet — frontier models at the time of the study. Randomised between AI-allowed and AI-forbidden. Run February to June 2025. Published 10 July as arXiv:2507.09089 by Becker, Rush, Barnes and Rein at METR.
The headline finding is the cleanest sign reversal in the field:
"After completing the study, developers estimate that allowing AI reduced completion time by 20%. Surprisingly, we find that allowing AI actually increases completion time by 19% — AI tooling slowed developers down."
Plus twenty perceived. Minus nineteen measured. The confidence interval was +2% to +39% — so a real slowdown, just not a precise one. Thirty-nine percentage points between what the clock said and what the developers said. Both expert forecasters and the developers themselves had predicted a speed-up north of 24%. They got the sign wrong.
That is not a tooling story. It is a measurement story. The instant week 6 of a pilot lands on a CTO's desk with positive developer sentiment and flat delivery metrics, the gap becomes visible to management. The tool has not failed. The measurement window has.
This is not a quirk of one study. It is the shape of every general-purpose technology, and it has a name.
The Shape Has a Name
In 2006, the political scientist Ian Bremmer published The J Curve to explain why authoritarian states become more unstable, not less, when they begin to open up. Stability falls before it rises. The trough is the dangerous part. The right-arm peak is higher than the starting point, but only if you survive the bottom.
Twelve years later, the economists Erik Brynjolfsson, Daniel Rock and Chad Syverson lifted the same shape into macroeconomics. Their 2018 Productivity J-Curve paper showed that general-purpose technologies — electricity, the personal computer, AI — require large intangible investments in skills, processes and data redesign before the productivity gains show up in the statistics. Measured output falls. Then it rises higher than before. Same shape, longer time constant.
Then, in July 2025, Kristina McElheran and co-authors put numbers on it for AI specifically. US Census Bureau data across tens of thousands of manufacturing firms. AI adoption produced an average short-run productivity decline of 1.33 percentage points. Older firms saw the deepest dips. Then they pulled ahead. McElheran's summary in MIT Sloan is the line to keep:
"That initial dip — the downward slope of the J-curve — is very real."
Rendering diagram...
The trough is the measurable bit. METR's −19% lands somewhere around week 6 for experienced developers. The right-arm magnitude is what Brynjolfsson, DORA and Faros's telemetry argue is reachable if you survive the bottom. The exact numbers are organisation-specific. The shape is not.
Geopolitics, macro, manufacturing. Three independent literatures, one curve. So if the shape is settled, what makes week 6 specifically so dangerous?
What Week 6 Actually Feels Like
The marathon runner's wall hits at roughly mile 20 of 26. The body has burned through glycogen and switched to fat metabolism. The legs are heavier. The pace drops. It is the worst possible mile to quit at, because the runner is past the halfway commitment and not yet close enough to see the finish. Trained runners know the wall is coming. Untrained runners interpret it as failure.
Week 6 of an AI pilot is mile 20.
The developers feel faster. Cognitive load is genuinely lower — the tool drafts the boilerplate, suggests the test, summarises the diff. The screen recordings tell a different story. More idle time. More verification overhead. More deletion. The longitudinal version of the same finding lands at ICSE'26. Sergeyuk and colleagues tracked 800 developers over two years and found:
"Telemetry reveals that AI users produce substantially more code but also delete significantly more. Meanwhile, survey respondents report productivity gains..."
The perception gap is not a METR artefact. It persists at two-year scale. And the throughput-without-quality signature is exactly what Faros AI's 2026 telemetry across 22,000 developers shows: epics per developer up 66%, incidents per PR up 242%, median PR review time up 441%. More output. More rework. Selenium's Jason Baum gave it the four-word version on the Coder podcast in October 2025:
"We're running before we walk with AI."
At week 6 the three signals are in open conflict. Self-report is positive. Throughput is flat. Verification overhead is rising. Which signal does leadership trust?
The Same Shape, Every Time
The right arm of the J is not theoretical. It shows up at every documented scale.
Google's DORA 2026 ROI of AI-Assisted Software Development report bakes the dip into its public calculator as a parameter — j_curve_drop=0.15, j_curve_duration=3. Fifteen percent productivity drop across the first three months. That single assumption, applied honestly, still yields a 39% first-year ROI and an 8-month payback in DORA's model. The dip is a line item. The DORA team's verdict on what actually goes wrong is the sentence to memorise:
"Initiatives often fail not because the technology is flawed but because leadership misinterprets this learning phase as a failure and pulls funding during the inevitable dip."
The case studies say the same thing in different time constants. Box ran a structured peer-mentorship programme — power users paired one hour a week with newer adopters — and in six weeks lifted Cursor usage by 75% and power users by 25% on the path to more than 85% daily adoption across 800+ engineers. Spotify began investigating background coding agents in February 2024. By November 2025, Honk had merged more than 1,500 agent-created PRs into production. Twenty-one months. Three engineering blog posts on context engineering and feedback-loop reliability published while the system was scaling. Goldman Sachs invested months of intensive preparation before fleet-wide Devin deployment, targeting a 3–4× productivity step-change according to CIO Marco Argenti.
Six weeks. Twenty-one months. Months of embedded preparation. Different scales, same shape. Every one of these organisations rode through a measurable dip. The ones who bought the J-curve as a line item came out the other side. The strongest evidence the right arm exists may be METR's own February 2026 follow-up: the slowdown narrowed to −4% with the confidence interval now crossing into positive territory, and 30–50% of developers refused to submit tasks that required working without AI. The selection bias makes the number unreliable. The refusal is the story.
But knowing the dip exists doesn't help if you're measuring the wrong things at the bottom of it.
What to Measure During the Dip
If you are a CTO reading this with a pilot in week 5, here are the lines to put on next week's dashboard — and the ones to take off.
Stop watching: PR count, story points, lines of code, raw velocity. TechEmpower's Tony Karrer writing from direct rollout observation: "Let adoption mature... your measure should focus on adoption and use to enable coaching."
Start watching, six signals:
- Adoption — at least 30–40% of provisioned engineers in weekly use by week 8.
- Diffusion — top 10% of users contributing no more than 50% of usage. Above that, you have a pet project, not a rollout.
- Cycle time, segmented — AI-touched PRs within ±15% of non-AI on cycle time is healthy at week 8. Improvements are bonus, not requirement.
- Change failure rate, segmented — AI-touched CFR within ±2–3 percentage points of non-AI. Wider than that in sensitive domains, narrow the surface area first.
- Verifier catch rate — first-pass review rate trending up. This is the leading indicator of quality before throughput moves, and the instrumentation for the Verify step I covered in The 5-Step Loop.
- Cost per developer per day — alert at 2× baseline. Runaway is structurally different from learning.
There is one genuine cut signal, and it is structurally different from the dip. If post-merge defect rate sits at 2× pre-adoption baseline for four consecutive weeks with no downward trend, pause the expansion and debug the harness. That is the stacking-staircase pattern I touched on in The Quiet Failure Inside the Agent — a compounding signal, not a transient one. Cut domain-specific before org-wide.
All of which assumes the conversation about the dip has already happened. If it hasn't, the dashboard will not save you.
Tell Your CFO Before You Sign the Contract
The most leveraged tactic in this entire post is pre-commitment. Budget the J-curve as a line item in the first-year cost model before the contract is signed. The DORA calculator default — 15% drop for three months — yields the 39% first-year ROI when modelled honestly. Plug your own numbers in. The point is not the precision. The point is that finance and leadership have agreed, on the record, that the dip is the plan.
Loss aversion is the reason this matters. Kahneman and Tversky's body of work on prospect theory is one long demonstration that humans process surprise losses far more harshly than expected ones. A 15% measured drop that was modelled as a 15% expected drop is on plan. The same 15% drop unmodelled is a crisis. The dashboard is identical. The reaction is not. The DORA finding bears restating because it is the entire failure mode in one sentence: leadership misreads the learning phase and pulls funding during the inevitable dip. Not because the technology broke.
Next week I'll cover what the right arm of the J actually settles at, once you net out downstream incident cost. Vendors will tell you 5–10×. Independent measurement says 1.2–1.5× net once you count the incidents the agents shipped. The shape is real. The peak is lower than the brochure.
Until then, two earlier pieces in this series sit alongside this one. The Governance Wall named the five primitives every agent rollout needs before deployment. The 5-Step Loop named the verification step that has to work for the dashboard above to mean anything. The J-curve is the shape the rollout makes around them.
The dip is real. The dip is also survivable. If your CFO doesn't know about the dip, the dashboard is academic. Tell them before the contract is signed.
Series Navigation
- Series opener: The Governance Wall — Why AI Agents Stall Before Production
- Part 1: The 5-Step Loop — Why Your Agent Fails at Step 4
- Part 2: The Productivity J-Curve — Why Your AI Pilot Looks Worst at Week 6 (you are here)
- Part 3: The Honest Productivity Number (coming soon)
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
1.2× Not 10×: The Honest Productivity Number Nobody's Publishing
GitHub said 55%. Then they ran the enterprise RCT and got 8.69%. Faros's two-year telemetry shows throughput up 66% and incidents up 243%. The honest net is 1.2–1.5×. Plan your team capacity accordingly.
Stop Building AI for AI's Sake — How VC Mindset Transforms Product Evaluation
AI projects fail at staggering rates by prioritizing technology over business outcomes. Discover how venture capital evaluation frameworks can prevent costly failures and deliver measurable ROI through business-first thinking.
OpenAI's AgentKit: Late to the Agent Party or Strategic Masterstroke?
OpenAI's AgentKit launch seems late to the agent party—but their track record suggests a strategic consolidation play that could dominate the $93B agentic AI market.