1.2× Not 10×: The Honest Productivity Number Nobody's Publishing
$ grep -n "^##" 2026-05-honest-ai-productivity-number-1-2x-not-10x.md>
Same Vendor. Same Product. 55% Became 8.69%.
GitHub said 55%. Then GitHub ran the rigorous version and got 8.69%. Same company, same product, the headline number quietly shed six-sevenths of its weight — and almost nobody updated their slides.
I measure my own AI productivity, because I run four or five Claude Code agents across my projects and I want to know what I'm actually buying. The number I land on is nowhere near 10×, and after twenty years of shipping production software I've learned to distrust any uplift figure that doesn't say what it's net of. So when I see the same vendor publish two numbers eight times apart, I want to know which one to plan a team around.
Here's how the gap opened. In September 2022, GitHub published the study that defined the decade's AI productivity narrative. Ninety-five developers. One JavaScript task. A lab. The headline: Copilot users finished 55% faster than the control group. That number is on every slide deck in every boardroom I've seen this year.
In May 2024, the same vendor published a second study — a randomised controlled trial inside Accenture, 450 developers, real codebases, real review cycles, real production. The headline was different: 8.69% more PRs per developer. Fifteen percent better merge rate. Eighty-four percent more successful builds. Same vendor, same product, two years apart, and the 55% had compressed to single digits.
Both numbers are real. They measure different things. The 95% confidence interval on the 55% figure runs from 21% all the way to 89% — a fact almost nobody quotes alongside it. Cursor's November 2025 paper added a third data point: 39% more PRs under agent mode, legitimate quasi-experimental methodology — but its own paper notes the revert rate didn't change and the bugfix rate slightly decreased. The throughput is real. The quality story isn't there.
So which one do you plan capacity on? That is the entire question. And the honest answer sits between them — closer to the lower end than any vendor wants to admit.
Gross Is Real. Net Is What Ships.
There is a frame for this that every CFO already knows. Revenue is a gross number. Nobody runs a business on revenue. You strip out cost of goods, operating expense, tax — and what survives is net income. A board that confused the two would not last a quarter.
AI productivity claims work the same way. PRs merged, tasks completed, tokens processed, lines generated — these are gross numbers. They are real. They are also before costs. The costs are downstream: incident remediation, security debt, senior-engineer review time, code that gets written and then deleted before it survives a fortnight in production. The vendor publishes gross. Nobody publishes net.
Gartner has been telling CFOs this for over a year. Randeep Rathindran, Distinguished Vice President, Research, in the Gartner Finance practice, was direct in March 2025:
"Despite the excitement surrounding AI, its impact on productivity has been inconsistent, leading to what some describe as the AI productivity paradox… CFOs should recalibrate expectations on how AI will truly impact worker productivity and headcount."
Recalibrate by how much? If gross is everywhere and net is nowhere, the obvious question is the size of the deductions. The largest available telemetry dataset has a brutal answer.
The Subtractions Faros Measured
In April 2026, Faros AI published the Acceleration Whiplash report — two years of longitudinal SDLC telemetry from 22,000 developers across 4,000+ enterprise teams. Not a survey. Raw data from version control, CI/CD pipelines, and incident management systems. They compared each organisation's lowest-AI-adoption period against its highest. The findings have a particular shape.
The gross side first:
| Metric | Change |
|---|---|
| Epics completed per developer | +66.2% |
| Task throughput per developer | +33.7% |
| PR merge rate per developer | +16.2% |
Genuine productivity. Now the subtractions:
| Metric | Change |
|---|---|
| Incidents per PR | +242.7% |
| Median time in code review | +441.5% |
| Code churn ratio | +861% |
| PRs merged with zero review | +31.3% |
| Bugs per developer | +54% (up from +9% in 2025) |
And the line that should stop the meeting:
Deployment frequency is down 11.7% even as PR merge rate is up 16.2%.
More code is being written. Less is shipping safely. The constraint has moved from production to qualification — review, testing, staging, the unglamorous work that makes code real. One precision note: 242.7% is incidents per PR, not per deployment. The ratio matters because AI inflates the denominator. Either way it is unwelcome.
Faros names the asterisk themselves:
"Throughput measures what was shipped, not what survived. The 861% is the asterisk on every output number in this report."
These are operational numbers. They become financial numbers when you put them through a planning model. Two months ago, DORA published one. Faros stress-tested it.
When You Plug Real Numbers Into DORA's Calculator
In April 2026, DORA released an interactive ROI of AI calculator alongside their report. Default scenario for a 500-person organisation: +$3.28M first-year benefit, +39.2% ROI, 0.7-year payback. It is the kind of number that closes a board slide cleanly.
DORA themselves call the calculator a "conversation starter." Faros made the conversation honest. They plugged in their telemetry — deployments down, features up, change-failure rate climbing from 5% to 15%, J-curve duration extended from DORA's optimistic 3 months to a telemetry-realistic 12. The combined scenario:
| Scenario | First-year benefit | ROI | Payback |
|---|---|---|---|
| DORA default | +$3.28M | +39.2% | 0.7 yr |
| J-curve realism only (12 months) | −$6.62M | −36.2% | 1.6 yr |
| Quality realism only (CFR 5→15%) | +$1.27M | +15.1% | 0.9 yr |
| Telemetry-informed combined | −$3.46M | −18.9% | 1.2 yr |
The most load-bearing variable is not throughput gain. It is J-curve duration. Changing only the J-curve from 3 months to 12 months swings the outcome by $9.9M. Nothing else in the model has that lever. Faros's framing on what this means for a CFO:
"A 1.2-year payback is something a CFO can plan around. A 0.7-year payback that turns into 1.6 because the J-curve input was set too short is something that erodes trust in every subsequent forecast. The slippage isn't in the math. It's in the inputs the user accepted without testing."
DORA's 2026 ROI report makes the same point in different language. Initiatives don't fail because the technology is flawed. They fail because leadership pulls funding during the natural learning-phase dip — exactly the moment the J-curve predicts and the default inputs hide. The calculator works. The defaults were optimistic. If you are a CTO presenting AI ROI to your CFO this quarter, the most dangerous assumption on the slide isn't throughput. It's the J-curve duration.
Plan at 1.3×. Hold a 20% Reserve. Don't Reach for More Agents.
So if 1.2–1.5× is the honest number and the J-curve is the load-bearing variable, what does that look like in a sprint plan? Five rules I've been using with the teams I work with:
-
Plan capacity at 1.2–1.5×, not 2× or 5×. Use the lower end (1.2×) for the first 4–6 sprints. Raise to 1.3–1.4× only when change-failure rate and deployment frequency are flat or improving for three consecutive sprints.
-
Hold 15–25% of sprint capacity as a remediation reserve. This is the slot for the 242.7% incident inflation. If you don't book it, your seniors burn weekends paying it.
-
Set a PR-noise budget: ≤2× pre-adoption defect rate per deployment. Exceeded for four consecutive weeks? Pause the rollout, debug the harness, do not add more agents.
-
Measure delivery, not throughput. Incident-free deployments per week and CFR are the planning unit. Not PR count. Not velocity points. Not vibes from standup.
-
When delivery slips, do not reach for more capacity. This is the anti-pattern Faros's data makes visible. The bottleneck under high AI adoption is human review, not code generation. Adding more agent capacity widens the queue without clearing it. Simon Willison's "shifted bottleneck" framing — which I've used in this series before — is the diagnosis. More agents do not clear the gate. They flood it.
Faros's recommendation to leaders is the bridge to the next post in this series:
"Do not rush to change headcount on the basis of first-year throughput numbers. The engineers absorbing the quality gap AI is creating are the ones you'll need most when the gap becomes visible."
Which raises an uncomfortable question. If the honest number is 1.3× and the loudest number is 10×, who carries the gap between them?
Running Before We Walk
The gap between 55% and 8.69% is a commitment trap, not a marketing distortion. Teams quote gross and ship net, and the difference lands on the engineers nobody photographs for the case study.
The cognitive engine for the inflation is well-documented. METR's 2025 RCT — which I unpacked in the J-Curve post — found that "developers expected AI to speed them up by 24%, and even after experiencing the slowdown, they still believed AI had sped them up by 20%." Telemetry said they got slower. Self-report fuels the inflation. Every vendor satisfaction survey is downstream of that gap.
The structural ceiling is just as visible. Anthropic's own 2026 Agentic Coding Trends Report puts engineers' AI usage at 60% of their work, with full delegation at 0–20%. The remaining 40 to 80 percentage points are supervision, and supervision does not compress. The 10× number assumes those points compress to zero. They do not.
The honest move is to bank the gains as buffer, not feature scope. Use the uplift to reduce on-call burden, pay down debt, train juniors, harden CI/CD. This is the running-before-we-walk problem in one line: vendor claims sprint, telemetry walks, and capacity plans that commit at sprint pace discover walking pace months later — when trust is the thing that gets spent.
Honest Numbers, Real Decisions
I plan my own work at something close to 1.3×, and even that took a few sprints of watching the bug queue catch up with me before I trusted it. The 10× figure isn't a lie so much as a measurement of the demo — what the model can do in the clean case, before review, before staging, before the incident three weeks later that nobody photographs for the case study. The 1.2–1.5× figure is what survives all of that. It's the number your sprint actually runs on, whether or not it's the number on the board slide.
The press release reports gross. Your capacity plan has to be written in net, because net is the only part anyone has to live inside. Quote the gross figure to your CFO and you're not forecasting; you're pre-committing the difference to a bug queue four sprints out, when the J-curve is at its worst and the board's patience is at its thinnest.
That difference doesn't evaporate. It gets paid, in someone's hours. The next post is about whose.
Series Navigation
- Post 1: The Governance Wall — Why Most AI Agents Can't Reach Production
- Post 2: The 5-Step Loop — Why Your Agent Fails at Step 4
- Post 3: The Productivity J-Curve — Why Week 6 of an AI Pilot Always Hurts
- Post 4: 1.2× Not 10× — The Honest Productivity Number Nobody's Publishing (you are here)
- Post 5: Protect the Juniors — Cognitive Debt and the Stack Overflow Collapse
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
The Productivity J-Curve: Why Your AI Pilot Looks Worst at Week 6
METR ran the experiment. AI made experienced developers 19% slower — and they reported feeling 20% faster. The week-6 dip is the bottom of a documented J-curve. Most pilots get cut here. The right ones don't.
Your AI Team Did Nothing While You Slept
Anthropic let Claude run a real shop for a month. It sold metal cubes at a loss, invented a Venmo account, and claimed to wear a blazer. The 'AI department that works while you sleep' is a genre — here's where it actually breaks.
The 30 Principles for Agentic Engineering — Part 2: The Lifecycle
Principles 6–14. How work moves through an agentic engineering team: the ticket as contract, AI distillation with human curation, three gates, verification before done, characterisation tests, the 1.2× capacity rule, the J-curve, and telemetry.