Stop Building AI for AI's Sake — How VC Mindset Transforms Product Evaluation
$ grep -n "^##" 2025-10-stop-building-ai-for-ais-sake-vc-mindset-transforms-evaluation.md>
- 2:The demo that works and solves nothing
- 12:The bank doesn't care about your architecture
- 22:What the smart money actually evaluates
- 36:Two companies that won by being boring
- 42:When the technology leads, here's how it dies
- 56:The questions I ask before I believe a demo
- 81:Why finance got here first
- 87:Adopting the lens on purpose
The demo that works and solves nothing
I sit on both sides of the table. Most weeks I'm building — orchestrating four or five Claude Code agents across my own projects, shipping agent frameworks, watching what actually survives contact with production. And in a venture-builder context, I'm also one of the people a founder walks through their AI product hoping for a cheque. So I've developed a reflex. Thirty seconds into a demo, while the model is doing something genuinely clever on screen, I've stopped watching the model. I'm waiting for the founder to tell me which human's afternoon just got shorter, and by how much. Most of them never get there. The demo works. It solves nothing anyone was paying to have solved.
That gap has a price tag. Repairing a failed AI implementation runs about €710,000 ($750,000) — roughly double the original budget, before you count the opportunity cost of the year you spent building the wrong thing. And the wrong thing is the norm: AI projects fail at meaningfully higher rates than ordinary IT work, and when you read the post-mortems the cause is almost never the model. It's that nobody upstream asked whether the model was answering a question the business actually had.
The reference case is IBM Watson for Oncology. IBM spent over $62 million on a system that ended up issuing unsafe treatment recommendations and got shut down — then folded the whole Watson Health division after pouring in billions. The technology was a marvel. It was also a marvel pointed at no clinician's real problem. I've watched the small-startup version of Watson more times than I can count: a team in love with what they built, unable to name the customer whose day it changes.
So here is the lens I actually use, the one that lets you tell a Watson from a winner before the money moves. It is not a technology lens. It is the question every good investor asks and most engineers forget to: who needed this, and how will we know it worked?
The bank doesn't care about your architecture
Amir Elkabir, who wrote "Lead with AI," put it in a single sentence that I've quoted in pitch meetings since:
"Banks don't care about model architecture—they care about lowering audit workload"
He's right, and the proof is in what banks measure. When JPMorgan Chase put AI on commercial loan agreement review, the headline number wasn't a benchmark or a parameter count. It was 360,000 hours saved annually — hours, not accuracy points. That's a procurement officer's language, not a researcher's.
Look across the sector and the pattern holds. The whole culture orients relentlessly toward compliance, cost reduction, and measurable return. AI-driven compliance monitoring cuts review time by up to 90%, taking the review of 1,000 recorded conversations from 500 hours down to 50 a month. BNY Mellon's robotic process automation hit 100% accuracy in account closures, an 88% improvement in processing time, and $300,000 in annual savings. Every one of those figures is denominated in money or time. None is denominated in cleverness. The buyers who get the most out of AI talk about it the way an investor talks about a company — and it's worth asking why the people building it so often don't.
What the smart money actually evaluates
Venture capitalists got their inoculation against technology hype the hard way, by surviving the dot-com bubble. The frameworks they built afterward are multi-dimensional and ROI-first: data strategy, market fit, team, a credible path to return — algorithmic sophistication doesn't even make the front page anymore. The thesis has quietly inverted. The question is no longer "is the AI impressive?" It's "does the AI move a number someone is willing to pay for?"
In practice that resolves to three things I look for, in this order:
- A data moat, not an algorithm. VCs prize startups with access to unique, high-quality, scalable datasets — because the model is a commodity in eighteen months and the data isn't.
- Unit economics that survive a spreadsheet. Revenue model, customer acquisition cost, contribution margin sit at the centre of the decision, not the architecture diagram.
- Evidence a stranger can check. Independent benchmarks, customer case studies, signed contracts, real usage metrics — claims I can verify without taking the founder's word.
BMW i Ventures says the quiet part at full volume: "Applied AI companies are where the party's at. I don't care about your next-gen neural net. All that matters is if the darned thing works." Jenny Fielding at Everywhere Ventures sharpens it: "AI is an enabling technology—it's not the business itself."
I'll be honest about where my own bias sits, because it cuts both ways. I love the engineering. Running a swarm of agents across a codebase is the most fun I've had building software in twenty years. Which is exactly why I don't trust myself in a demo, and why the investor reflex matters — it's the discipline that stops me funding the version of a product I'd enjoy building over the version a customer would pay for.
Two companies that won by being boring
UiPath didn't win VC backing with a flashy model. It won by handing enterprises immediate efficiency gains and cost reductions they could put on a spreadsheet inside a quarter. Databricks did the same with a unified analytics platform powering mission-critical infrastructure — not a better algorithm, a better outcome from the data a company already had. On the factory floor the same discipline shows up as 280% ROI over 18 months, with predictive maintenance cutting expenses 30–40% against reactive models.
The thread is almost embarrassingly simple. They started from a business problem and reached for technology only when the problem demanded it. The cool part of the technology was incidental.
When the technology leads, here's how it dies
I want to be precise about the failure mode, because "AI project failed" is too vague to learn from. Every spectacular collapse below has the same shape: a powerful capability deployed in front of a business requirement nobody mapped first.
IBM Watson for Oncology — the $62 million number again — wasn't beaten by a better model. It was trained on hypothetical patient data and built as a showcase, so it gave unsafe treatment recommendations. The sophistication was real; the clinical requirement was an afterthought.
Amazon's AI recruiting tool systematically downgraded resumes containing the word "women," because it learned from a decade of biased hiring history. The model did exactly what it was trained to do. Fairness and legal exposure were the business requirements, and they were never in the spec.
McDonald's drive-thru AI got shut down after years and millions because it misheard orders and frustrated customers. Excellent speech recognition met the actual operating conditions of a drive-thru and lost.
Air Canada's chatbot invented a bereavement-fare refund policy, a customer relied on it, and when the refund was denied a tribunal held the airline liable for what its bot said. A capable language model with no connection to the company's real rules didn't save money — it generated a legal liability.
Read those four together and you notice the failure was always upstream of the engineering. Which is the whole argument for moving the evaluation upstream too — and here is where the bottleneck I keep running into across my own work resurfaces. The hard question is no longer "can the AI do it?" In every one of these cases, the AI did the thing. The hard question is "can a human verify it's the right thing before the business acts on it?" Watson could recommend; nobody could safely trust the recommendation. The chatbot could answer; nobody had checked the answer against policy. The capability outran the verification, every time.
The questions I ask before I believe a demo
Peak Capital uses 31 questions to separate real AI ventures from theatre. You don't need all 31. You need the handful that a founder in love with their model can't answer without flinching. These are the ones I lean on — and they work just as well pointed inward, at your own roadmap, as they do across a pitch table.
Start where the AI isn't.
- Describe the customer's problem without using the word "AI." If you can't, you don't have a problem, you have a technology looking for one.
- Is that problem currently costing measurable time or money? Name the figure.
- Would a boring, non-AI solution do nearly as well? (If yes, build that.)
Make the win measurable.
- Which specific KPI does this move — revenue, cost, hours, error rate?
- How is success counted, in dollars and time, not in vibes?
- Which business stakeholder picked that metric, and will they sign their name to it?
Demand evidence a stranger can check.
- Pilot results, case studies, an independent benchmark — anything I don't have to take on faith.
- A demonstrable improvement over the process that exists today.
Pressure-test the moat and the failure modes.
- What proprietary data or integration makes this hard to copy next year?
- What breaks it, who gets hurt when it breaks, and how is that contained?
- What does it cost to own — not to buy, to own — including the maintenance nobody budgets for?
A founder who breezes through those is rare and worth backing. A founder who reaches for model performance every time you ask a business question is telling you, gently, that there's no business underneath the model.
Why finance got here first
Banks didn't develop this discipline because bankers are wiser. They developed it because regulation gave them no choice, and that accident of constraint turned them into the best AI buyers in the economy. The EU AI Act's strict requirements for "high-risk" applications — transparency, fairness, bias mitigation, supervisory audit — force a bank to evaluate a tool on what it does to compliance, not on how clever it is. Before adoption they demand compliance documentation, integration roadmaps, and third-party risk assessments. The paperwork outranks the architecture by mandate.
And the results read like the success column of every other section. Wells Fargo's assistant handled over 20 million interactions; HSBC's AI anti-money-laundering work cut false positives sharply. As S&P Global puts it, "AI strategies have the potential to provide competitive advantages to banks that have the capacity and flexibility to make best use of them" — the operative phrase being best use, which is a business judgment, not a technical one. The rest of us get to choose this discipline voluntarily, which is harder, because nobody's forcing our hand.
Adopting the lens on purpose
If I had to compress the whole thing into a working habit, it's this: refuse to let the technology choose the problem.
Elkabir again, because he says it cleaner than I can: "For AI products to deliver value, organizations must align AI initiatives directly with business objectives and operational needs. Unless AI is implemented to solve tangible business problems, it is all just noise". Darren Ott at Dolby reaches the same place from the engineering side: "Don't look at AI for AI's sake. Look at the problem that you want to solve, and then bring the technology in to fix it rather than saying, 'Oh, let's have AI.'"
The practice that makes this real is small, fast, and humbling. VCs expect founders to build an MVP and test the assumption quickly rather than design a cathedral on day one — and it's the same instinct that has me throwing four agents at a narrow slice of a problem before I trust the approach at scale. McDonald's China is the textbook case: by aiming AI at one workflow rather than a showcase, monthly employee transactions went from 2,000 to 30,000, a 1,400% jump from solving one real thing well.
Then make the metric non-negotiable before anyone writes code. Five Sigma's claims system shipped an 80% drop in errors, 25% more adjuster productivity, and a 10% shorter cycle; John Deere's vision system cut non-residual herbicide use by more than two-thirds. Those numbers existed as targets before they existed as results. That's the tell of a project built business-first: the success criteria predate the technology.
And someone has to carry it. The teams that move AI from pilot to production almost always have an internal champion who does the unglamorous work of stakeholder alignment — a person fluent in business value, not just technical merit, evaluating each idea through compliance, risk, operations, and impact at once. Finance does this with cross-functional review by default. Everyone else has to build the muscle on purpose.
I came into AI as a builder — twenty years of it, from server racks in London Docklands to whatever I'm shipping from Singapore this week — and the instinct of a builder is to be moved by what's possible. The investor reflex is the corrective: to be moved instead by what's needed. The two have to live in the same head, because the most dangerous AI project isn't the one that fails to work. It's the one that works beautifully, demos brilliantly, costs three quarters of a million to unwind, and turns out to have answered a question nobody was asking. The model was never the hard part. Deciding it was worth building — that's the part only a person, thinking like an investor, can get right.
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
The Quiet Failure Inside the Agent
AI agents don't fail loudly — they degrade silently, returning 200 OK while the damage compounds. Inside the $47K loops, NOHARM omissions, and the engineering discipline rebuilding observable failure.
The Productivity J-Curve: Why Your AI Pilot Looks Worst at Week 6
METR ran the experiment. AI made experienced developers 19% slower — and they reported feeling 20% faster. The week-6 dip is the bottom of a documented J-curve. Most pilots get cut here. The right ones don't.
Three Ingredients, Three Labs, One Squeeze: Reading the 2026 AI Compute Crisis
Anthropic just leased Elon Musk's supercomputer four months after he banned them. Here's the three-ingredient framework that explains why — and what it means if you build on Claude.