1.2× Not 10×: The Honest Productivity Number Nobody's Publishing
$ grep -n "^##" 2026-05-honest-ai-productivity-number-1-2x-not-10x.md>
Same Vendor. Same Product. 55% Became 8.69%.
GitHub said 55%. Then GitHub ran the rigorous version and got 8.69%. Same company, same product, the headline number quietly shed six-sevenths of its weight — and almost nobody updated their slides.
I measure my own AI productivity, because I run four or five Claude Code agents across my projects and I want to know what I'm actually buying. The number I land on is nowhere near 10×, and after twenty years of shipping production software I've learned to distrust any uplift figure that doesn't say what it's net of.
Here's how the gap opened. In September 2022, GitHub published the study that defined the decade's AI productivity narrative. Ninety-five developers. One JavaScript task. A lab. The headline: Copilot users finished 55% faster than the control group. That number is on every slide deck in every boardroom I've seen this year.
In May 2024, the same vendor published a second study — a randomised controlled trial inside Accenture, 450 developers, real codebases, real review cycles, real production. The headline was different: 8.69% more PRs per developer. Fifteen percent better merge rate. Two years apart, the 55% had compressed to single digits.
Both numbers are real; they measure different things. The 95% confidence interval on the 55% figure runs from 21% all the way to 89% — a fact almost nobody quotes alongside it. Cursor's November 2025 paper added a third data point: 39% more PRs under agent mode, legitimate quasi-experimental methodology — but its own paper notes the revert rate didn't change and the bugfix rate slightly decreased. The throughput is real. The quality story isn't there.
So which one do you plan capacity on? The honest answer sits between them — closer to the lower end than any vendor wants to admit.
Gross Is Real. Net Is What Ships.
There is a frame every CFO already knows. Revenue is a gross number; nobody runs a business on it. You strip out cost of goods, operating expense, tax — and what survives is net income. A board that confused the two would not last a quarter.
AI productivity claims work the same way. PRs merged, tasks completed, tokens processed, lines generated — gross numbers, real, but before costs. The costs are downstream: incident remediation, security debt, senior-engineer review time, code written and then deleted before it survives a fortnight in production. The vendor publishes gross. Nobody publishes net.
Gartner has been telling CFOs this for over a year. Randeep Rathindran, Distinguished Vice President in the Gartner Finance practice, was direct in March 2025:
"Despite the excitement surrounding AI, its impact on productivity has been inconsistent, leading to what some describe as the AI productivity paradox… CFOs should recalibrate expectations on how AI will truly impact worker productivity and headcount."
Recalibrate by how much? If gross is everywhere and net is nowhere, the question is the size of the deductions. The largest available telemetry dataset has a brutal answer.
The Subtractions Faros Measured
In April 2026, Faros AI published the Acceleration Whiplash report — two years of longitudinal SDLC telemetry from 22,000 developers across 4,000+ enterprise teams. Not a survey. Raw data from version control, CI/CD pipelines, and incident management systems, comparing each organisation's lowest-AI-adoption period against its highest.
The gross side first:
| Metric | Change |
|---|---|
| Epics completed per developer | +66.2% |
| Task throughput per developer | +33.7% |
| PR merge rate per developer | +16.2% |
Genuine productivity. Now the subtractions:
| Metric | Change |
|---|---|
| Incidents per PR | +242.7% |
| Median time in code review | +441.5% |
| Code churn ratio | +861% |
| PRs merged with zero review | +31.3% |
| Bugs per developer | +54% (up from +9% in 2025) |
And the line that should stop the meeting:
Deployment frequency is down 11.7% even as PR merge rate is up 16.2%.
More code is being written. Less is shipping safely. The constraint has moved from production to qualification — review, testing, staging, the unglamorous work that makes code real. (One precision note: 242.7% is incidents per PR, not per deployment; the ratio matters because AI inflates the denominator.)
Faros names the asterisk themselves:
"Throughput measures what was shipped, not what survived. The 861% is the asterisk on every output number in this report."
These operational numbers become financial numbers when you put them through a planning model. Two months ago, DORA published one. Faros stress-tested it.
When You Plug Real Numbers Into DORA's Calculator
In April 2026, DORA released an interactive ROI of AI calculator alongside their report. Default scenario for a 500-person organisation: +$3.28M first-year benefit, +39.2% ROI, 0.7-year payback. The kind of number that closes a board slide cleanly.
DORA themselves call the calculator a "conversation starter." Faros made the conversation honest, plugging in their telemetry — deployments down, features up, change-failure rate climbing from 5% to 15%, J-curve duration extended from DORA's optimistic 3 months to a telemetry-realistic 12. The combined scenario:
| Scenario | First-year benefit | ROI | Payback |
|---|---|---|---|
| DORA default | +$3.28M | +39.2% | 0.7 yr |
| J-curve realism only (12 months) | −$6.62M | −36.2% | 1.6 yr |
| Quality realism only (CFR 5→15%) | +$1.27M | +15.1% | 0.9 yr |
| Telemetry-informed combined | −$3.46M | −18.9% | 1.2 yr |
The most load-bearing variable is not throughput gain. It is J-curve duration. Changing only the J-curve from 3 months to 12 swings the outcome by $9.9M. Nothing else in the model has that lever. Faros's framing on what this means for a CFO:
"A 1.2-year payback is something a CFO can plan around. A 0.7-year payback that turns into 1.6 because the J-curve input was set too short is something that erodes trust in every subsequent forecast. The slippage isn't in the math. It's in the inputs the user accepted without testing."
DORA's 2026 ROI report makes the same point in different language. Initiatives don't fail because the technology is flawed. They fail because leadership pulls funding during the natural learning-phase dip — exactly the moment the J-curve predicts and the default inputs hide. The calculator works; the defaults were optimistic.
Plan at 1.3×. Hold a 20% Reserve. Don't Reach for More Agents.
So what does 1.2–1.5× look like in a sprint plan? Five rules I've been using with the teams I work with:
-
Plan capacity at 1.2–1.5×, not 2× or 5×. Use 1.2× for the first 4–6 sprints; raise to 1.3–1.4× only when change-failure rate and deployment frequency hold flat or improve for three consecutive sprints.
-
Hold 15–25% of sprint capacity as a remediation reserve. This is the slot for the 242.7% incident inflation. Don't book it and your seniors burn weekends paying it.
-
Set a PR-noise budget: ≤2× pre-adoption defect rate per deployment. Exceeded for four consecutive weeks? Pause the rollout, debug the harness, do not add more agents.
-
Measure delivery, not throughput. Incident-free deployments per week and CFR are the planning unit. Not PR count, not velocity points, not vibes from standup.
-
When delivery slips, do not reach for more capacity. The bottleneck under high AI adoption is human review, not code generation. More agent capacity widens the queue without clearing it — Simon Willison's "shifted bottleneck," which I've used in this series before. More agents don't clear the gate. They flood it.
Faros's recommendation to leaders is the bridge to the next post:
"Do not rush to change headcount on the basis of first-year throughput numbers. The engineers absorbing the quality gap AI is creating are the ones you'll need most when the gap becomes visible."
Running Before We Walk
The cognitive engine for the inflation is well-documented. METR's 2025 RCT — which I unpacked in the J-Curve post — found that "developers expected AI to speed them up by 24%, and even after experiencing the slowdown, they still believed AI had sped them up by 20%." Telemetry said they got slower. Every vendor satisfaction survey is downstream of that gap.
The structural ceiling is just as visible. Anthropic's own 2026 Agentic Coding Trends Report puts engineers' AI usage at 60% of their work, with full delegation at 0–20%. The remaining 40 to 80 percentage points are supervision, and supervision does not compress. The 10× number assumes those points compress to zero. They do not.
I plan my own work at something close to 1.3×, and even that took a few sprints of watching the bug queue catch up with me before I trusted it. The 10× figure isn't a lie so much as a measurement of the demo — what the model can do in the clean case, before review, before staging, before the incident three weeks later that nobody photographs for the case study. The 1.2–1.5× figure is what survives all of that.
So bank the gains as buffer, not feature scope: reduce on-call burden, pay down debt, train juniors, harden CI/CD. Quote the gross figure to your CFO and you're not forecasting; you're pre-committing the difference to a bug queue four sprints out, when the J-curve is at its worst and the board's patience is at its thinnest. That difference doesn't evaporate. It gets paid, in someone's hours. The next post is about whose.
Series Navigation
- Post 1: The Governance Wall — Why Most AI Agents Can't Reach Production
- Post 2: The 5-Step Loop — Why Your Agent Fails at Step 4
- Post 3: The Productivity J-Curve — Why Week 6 of an AI Pilot Always Hurts
- Post 4: 1.2× Not 10× — The Honest Productivity Number Nobody's Publishing (you are here)
- Post 5: Protect the Juniors — Cognitive Debt and the Stack Overflow Collapse
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
The Productivity J-Curve: Why Your AI Pilot Looks Worst at Week 6
METR ran the experiment. AI made experienced developers 19% slower — and they reported feeling 20% faster. The week-6 dip is the bottom of a documented J-curve. Most pilots get cut here. The right ones don't.
Your AI Team Did Nothing While You Slept
Anthropic let Claude run a real shop for a month. It sold metal cubes at a loss, invented a Venmo account, and claimed to wear a blazer. The 'AI department that works while you sleep' is a genre — here's where it actually breaks.
The 30 Principles for Agentic Engineering — Part 2: The Lifecycle
Principles 6–14. How work moves through an agentic engineering team: the ticket as contract, AI distillation with human curation, three gates, verification before done, characterisation tests, the 1.2× capacity rule, the J-curve, and telemetry.