Vibe Coding vs Agentic Engineering: Where the Prototype Stops and Production Starts
Andrej Karpathy — the man who coined "vibe coding" — got bitten by his own creation. In his April 2026 Sequoia Ascent talk, he described a payments bug in his own side project, MenuGen. The agent had decided to match Stripe purchases to user accounts by email address. Stripe email and Google email can differ. Real users could pay and never receive what they bought.
"Why would you use email addresses to cross-correlate funds? You need a persistent user ID. This is the kind of mistake agents still make."
A prototype workflow ran into a production-grade problem. The agent had no way to see that "match by email" is fine for a demo and indefensible for money. The bug isn't that the agent was wrong. It's that the prototype operating mode followed the code across a boundary it shouldn't have crossed.
This is what most "AI failed in production" stories really are.
Two names, two modes
Vibe coding was coined by Karpathy in a tweet on 11 February 2025:
"There's a new kind of coding I call 'vibe coding', where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. […] I 'Accept All' always, I don't read the diffs anymore. […] It's not too bad for throwaway weekend projects."
The qualifier — "throwaway weekend projects" — was in the coinage. The boundary was named on day one.
Agentic engineering is Simon Willison's term for what the other end looks like. He'd been circling the vocabulary for a year — first as "vibe engineering" in October 2025, then committing to "agentic engineering" in February 2026:
"I'm using Agentic Engineering to refer to building software using coding agents — tools like Claude Code and OpenAI Codex, where the defining feature is that they can both generate and execute code — allowing them to test that code and iterate on it independently of turn-by-turn guidance from their human supervisor."
The contrast he draws is sharper than the tools suggest:
"I think of vibe coding using its original definition of coding where you pay no attention to the code at all […]. Agentic Engineering represents the other end of the scale: professional software engineers using coding agents to improve and accelerate their work by amplifying their existing expertise."
These are not points on a quality spectrum. They are different operating modes. Vibe coding is fast, loose, prompt-driven, no diff review. Agentic engineering is structured intake, verified output, human gates, telemetry. The same tools serve both. The discipline is what changes.
What actually changes at the boundary
Rendering diagram...
Five things flip when a prototype crosses the boundary into something other people depend on:
-
Intake. A prompt becomes a spec. Acceptance criteria are written before the agent runs, not extracted from the diff after. Anthropic's 2026 Agentic Coding Trends Report calls this "intent as infrastructure" — specs as durable, executable artefacts.
-
Verification. "Try it and see" becomes a Stop hook running lint, typecheck, and tests on every turn. Lightrun's 2026 industry survey found 43% of AI-generated code still required manual debugging in production after passing QA and staging. The test pyramid is no longer sufficient evidence.
-
Review gates. "Accept All, don't read the diffs" becomes a human reading every system-level decision, every security boundary, every data flow. Karpathy himself put the line at: "The agentic engineer does not blindly accept generated code. They design specs, supervise plans, inspect diffs, write tests, create evaluation loops, manage permissions, isolate worktrees, and preserve quality."
-
Telemetry. "Watch the console" becomes OTEL spans per session, per turn, per tool call. Beam AI's production framing is blunt: "Every action an AI agent takes in a production environment should be traceable." Agent observability is now a first-class operational concern.
-
Harness profile. Permissive in
experiments/. Strict insrc/.permissions.denyfor.env*andsecrets/**. Hooks that deterministically block dangerous shell. This is the harness layer of the five-layer model doing its actual job — not as decoration, as the load-bearing wall between modes.
Each of these costs roughly nothing in a weekend project. Each of them is the difference between shipping software and shipping liabilities to a payments table.
The anti-pattern has a name
The corpus calls it the vibe-coding leak: weekend experiment built with "accept all diffs, don't read", ships to prod, six weeks later the security defects emerge. The cause is always the same — no promotion gate.
Columbia DAPLab's January 2026 study catalogued nine failure categories from state-of-the-art coding agents on non-toy tasks. The most dangerous wasn't crashed code:
"The code appears to run without errors, but the app doesn't actually do what the user asked."
The other line worth tattooing somewhere is from the same paper:
"Currently, vibe coding agents see your requirements as preferences, not as enforced policies."
Or Kognitos's practitioner version of the same finding, one sentence shorter:
"The same qualities that make AI-generated code fast to produce make it dangerous to operate."
The cruelest twist in this story is that Willison — the defender of the production discipline — caught himself sliding into the anti-pattern. In a May 2026 post, he admitted he had quietly stopped reviewing every line of AI-generated code, even for his production work. He named the mechanism:
"There's an element of the normalization of deviance here — every time a model turns out to have written the right code without me monitoring it closely there's a risk that I'll trust it at the wrong moment in the future and get burned."
If the coiner of "agentic engineering" can drift back into vibe coding without noticing, this is not a discipline problem you solve once. It is a discipline you re-enforce structurally — at the harness layer, at the gate layer, at the review layer — because individual willpower won't hold the line.
The move
The mistake most teams make is thinking the transition is about quality — "we just need to make our AI code better." It isn't. Quality follows the operating mode. The transition is a workflow rebuild at a specific boundary.
The five-stage maturity model frames the move the same way: Stage 1 looks a lot like vibe coding — one engineer, one head, no shared discipline. Stage 3+ — hooks enforcing the rules, subagents specialised by task, harness profile per directory — is agentic engineering operating at team scale. Crossing from one to the other isn't a campaign. It's a mode switch with named consequences.
This piece is the workflow-level sibling to the prompt-to-context-engineering shift: both are discipline-renaming moments. Prompt → context engineering names the skill that changes. Vibe coding → agentic engineering names the workflow that changes. You can be brilliant at context engineering and still leak vibe-coded code into production if the workflow boundary isn't there.
Keep vibe coding. It is genuinely useful for prototypes, experiments, and the throwaway projects Karpathy was talking about. Just don't promote the code without rebuilding the workflow underneath it.
Willison's golden rule, from the original 2025 disambiguation post, is still the cleanest test:
"I won't commit any code to my repository if I couldn't explain exactly what it does to somebody else."
If that sentence is true of your production diffs, you are doing agentic engineering. If it isn't, you are vibe coding — and you should know which side of the gate you are on.
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
The 5-Step Loop: Why Your Agent Fails at Step 4
ReAct gave us a three-step loop. Production hardened it into five. The two new steps — Plan and Verify — are where everything that goes wrong, goes wrong. And the field has now named the worst offender.
Standardise the Harness, Customise the Work: The 5-Layer Agent Architecture
Three open-source extractions converged on the same five layers. The architecture isn't a vendor narrative — it's a discovered structure. Here's the decision rule that keeps you from over-engineering it.
The 30 Principles for Agentic Engineering — Part 5: Calibration and Reality
Principles 26–30. The calibration layer that catches what the rest of the framework would miss: a PR-noise budget, independent verification, model-swap regression discipline, the 15-tool-call rule, and protecting junior development.