1.2× Not 10×: The Honest Productivity Number Nobody's Publishing
Same Vendor. Same Product. Two Different Numbers.
In September 2022, GitHub published a study that defined the decade's AI productivity narrative. Ninety-five developers. One JavaScript task. A lab. The headline: Copilot users finished 55% faster than the control group. That number is on every slide deck in every boardroom I've seen this year.
In May 2024, the same vendor published a second study. This one was a randomised controlled trial inside Accenture — 450 developers, real codebases, real review cycles, real production. The headline was different: 8.69% more PRs per developer. Fifteen percent better merge rate. Eighty-four percent more successful builds. Same vendor, same product, two years apart, and the 55% had compressed to single digits.
Both numbers are real. They measure different things. The 95% confidence interval on the 55% figure runs from 21% all the way to 89% — a fact almost nobody quotes alongside it. Cursor's November 2025 paper added a third data point: 39% more PRs under agent mode, legitimate quasi-experimental methodology — but its own paper notes the revert rate didn't change and the bugfix rate slightly decreased. The throughput is real. The quality story isn't there.
So which one do you plan capacity on? That is the entire question. And the honest answer sits between them — closer to the lower end than any vendor wants to admit.
Gross Is Real. Net Is What Ships.
There is a frame for this that every CFO already knows. Revenue is a gross number. Nobody runs a business on revenue. You strip out cost of goods, operating expense, tax — and what survives is net income. A board that confused the two would not last a quarter.
AI productivity claims work the same way. PRs merged, tasks completed, tokens processed, lines generated — these are gross numbers. They are real. They are also before costs. The costs are downstream: incident remediation, security debt, senior-engineer review time, code that gets written and then deleted before it survives a fortnight in production. The vendor publishes gross. Nobody publishes net.
Gartner has been telling CFOs this for over a year. Randeep Rathindran, Distinguished Vice President, Research, in the Gartner Finance practice, was direct in March 2025:
"Despite the excitement surrounding AI, its impact on productivity has been inconsistent, leading to what some describe as the AI productivity paradox… CFOs should recalibrate expectations on how AI will truly impact worker productivity and headcount."
Recalibrate by how much? If gross is everywhere and net is nowhere, the obvious question is the size of the deductions. The largest available telemetry dataset has a brutal answer.
The Subtractions Faros Measured
In April 2026, Faros AI published the Acceleration Whiplash report — two years of longitudinal SDLC telemetry from 22,000 developers across 4,000+ enterprise teams. Not a survey. Raw data from version control, CI/CD pipelines, and incident management systems. They compared each organisation's lowest-AI-adoption period against its highest. The findings have a particular shape.
The gross side first:
| Metric | Change |
|---|---|
| Epics completed per developer | +66.2% |
| Task throughput per developer | +33.7% |
| PR merge rate per developer | +16.2% |
Genuine productivity. Now the subtractions:
| Metric | Change |
|---|---|
| Incidents per PR | +242.7% |
| Median time in code review | +441.5% |
| Code churn ratio | +861% |
| PRs merged with zero review | +31.3% |
| Bugs per developer | +54% (up from +9% in 2025) |
And the line that should stop the meeting:
Deployment frequency is down 11.7% even as PR merge rate is up 16.2%.
More code is being written. Less is shipping safely. The constraint has moved from production to qualification — review, testing, staging, the unglamorous work that makes code real. One precision note: 242.7% is incidents per PR, not per deployment. The ratio matters because AI inflates the denominator. Either way it is unwelcome.
Faros names the asterisk themselves:
"Throughput measures what was shipped, not what survived. The 861% is the asterisk on every output number in this report."
These are operational numbers. They become financial numbers when you put them through a planning model. Two months ago, DORA published one. Faros stress-tested it.
When You Plug Real Numbers Into DORA's Calculator
In April 2026, DORA released an interactive ROI of AI calculator alongside their report. Default scenario for a 500-person organisation: +$3.28M first-year benefit, +39.2% ROI, 0.7-year payback. It is the kind of number that closes a board slide cleanly.
DORA themselves call the calculator a "conversation starter." Faros made the conversation honest. They plugged in their telemetry — deployments down, features up, change-failure rate climbing from 5% to 15%, J-curve duration extended from DORA's optimistic 3 months to a telemetry-realistic 12. The combined scenario:
| Scenario | First-year benefit | ROI | Payback |
|---|---|---|---|
| DORA default | +$3.28M | +39.2% | 0.7 yr |
| J-curve realism only (12 months) | −$6.62M | −36.2% | 1.6 yr |
| Quality realism only (CFR 5→15%) | +$1.27M | +15.1% | 0.9 yr |
| Telemetry-informed combined | −$3.46M | −18.9% | 1.2 yr |
The most load-bearing variable is not throughput gain. It is J-curve duration. Changing only the J-curve from 3 months to 12 months swings the outcome by $9.9M. Nothing else in the model has that lever. Faros's framing on what this means for a CFO:
"A 1.2-year payback is something a CFO can plan around. A 0.7-year payback that turns into 1.6 because the J-curve input was set too short is something that erodes trust in every subsequent forecast. The slippage isn't in the math. It's in the inputs the user accepted without testing."
DORA's 2026 ROI report makes the same point in different language. Initiatives don't fail because the technology is flawed. They fail because leadership pulls funding during the natural learning-phase dip — exactly the moment the J-curve predicts and the default inputs hide. The calculator works. The defaults were optimistic. If you are a CTO presenting AI ROI to your CFO this quarter, the most dangerous assumption on the slide isn't throughput. It's the J-curve duration.
Plan at 1.3×. Hold a 20% Reserve. Don't Reach for More Agents.
So if 1.2–1.5× is the honest number and the J-curve is the load-bearing variable, what does that look like in a sprint plan? Five rules I've been using with the teams I work with:
-
Plan capacity at 1.2–1.5×, not 2× or 5×. Use the lower end (1.2×) for the first 4–6 sprints. Raise to 1.3–1.4× only when change-failure rate and deployment frequency are flat or improving for three consecutive sprints.
-
Hold 15–25% of sprint capacity as a remediation reserve. This is the slot for the 242.7% incident inflation. If you don't book it, your seniors burn weekends paying it.
-
Set a PR-noise budget: ≤2× pre-adoption defect rate per deployment. Exceeded for four consecutive weeks? Pause the rollout, debug the harness, do not add more agents.
-
Measure delivery, not throughput. Incident-free deployments per week and CFR are the planning unit. Not PR count. Not velocity points. Not vibes from standup.
-
When delivery slips, do not reach for more capacity. This is the anti-pattern Faros's data makes visible. The bottleneck under high AI adoption is human review, not code generation. Adding more agent capacity widens the queue without clearing it. Simon Willison's "shifted bottleneck" framing — which I've used in this series before — is the diagnosis. More agents do not clear the gate. They flood it.
Faros's recommendation to leaders is the bridge to the next post in this series:
"Do not rush to change headcount on the basis of first-year throughput numbers. The engineers absorbing the quality gap AI is creating are the ones you'll need most when the gap becomes visible."
Which raises an uncomfortable question. If the honest number is 1.3× and the loudest number is 10×, who carries the gap between them?
Running Before We Walk
The gap between 55% and 8.69% isn't a marketing distortion. It is a commitment trap. Teams are quoting gross and shipping net, and the difference is being absorbed by the engineers nobody is photographing for the case study.
The cognitive engine for the inflation is well-documented. METR's 2025 RCT — which I unpacked in the J-Curve post — found that "developers expected AI to speed them up by 24%, and even after experiencing the slowdown, they still believed AI had sped them up by 20%." Telemetry said they got slower. Self-report fuels the inflation. Every vendor satisfaction survey is downstream of that gap.
The structural ceiling is just as visible. Anthropic's own 2026 Agentic Coding Trends Report puts engineers' AI usage at 60% of their work, with full delegation at 0–20%. The remaining 40 to 80 percentage points are supervision, and supervision does not compress. The 10× number assumes those points compress to zero. They do not.
The honest move is to bank the gains as buffer, not feature scope. Use the uplift to reduce on-call burden, pay down debt, train juniors, harden CI/CD. Anything else commits the buffer twice — once to the headline, once to the bug queue four sprints later when the J-curve is at its worst and the board's patience is at its thinnest. This is the running-before-we-walk problem. Vendor claims sprint. Telemetry walks. Capacity plans commit at sprint pace and discover walking pace four sprints later, when trust is the thing that gets spent.
Honest Numbers, Real Decisions
Back to 55% versus 8.69%. The gap between them isn't a marketing problem. It is the gap between the demo and the deployment, between the slide and the sprint, between what the model can do and what the system around the model can accept. Both numbers are real. Only one survives contact with production.
The 1.2–1.5× number is unglamorous. It is also defensible, plannable, and survivable. The 10× number is exciting and ungovernable. Pick the one you can commit to without rebuilding your board's trust every quarter.
The press release is gross. Nobody is publishing net. That is the choice your next capacity plan is going to make, whether you name it or not.
The gap has to be paid in someone's hours. The next post is about whose.
Series Navigation
- Post 1: The Governance Wall — Why Most AI Agents Can't Reach Production
- Post 2: The 5-Step Loop — Why Your Agent Fails at Step 4
- Post 3: The Productivity J-Curve — Why Week 6 of an AI Pilot Always Hurts
- Post 4: 1.2× Not 10× — The Honest Productivity Number Nobody's Publishing (you are here)
- Post 5: Protect the Juniors (coming soon)
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
The Productivity J-Curve: Why Your AI Pilot Looks Worst at Week 6
METR ran the experiment. AI made experienced developers 19% slower — and they reported feeling 20% faster. The week-6 dip is the bottom of a documented J-curve. Most pilots get cut here. The right ones don't.
Protect the Juniors: Cognitive Debt and the Stack Overflow Collapse
AI is making junior output look senior-level while preventing junior skill from forming — and the Stack Overflow collapse just removed the ambient learning layer that used to catch the deficit. Three interventions that work.
The 5-Step Loop: Why Your Agent Fails at Step 4
ReAct gave us a three-step loop. Production hardened it into five. The two new steps — Plan and Verify — are where everything that goes wrong, goes wrong. And the field has now named the worst offender.