OpenAI's AgentKit: Late to the Agent Party or Strategic Masterstroke?
$ grep -n "^##" 2025-10-openai-agentkit-late-to-the-party-or-strategic-masterstroke.md>
The Setup
October 6, 2025. Sam Altman walks onto the stage and announces OpenAI's "something new" — AgentKit and Agent Builder.
The eyebrows went up across half my timeline. Agents? Seriously? We've been gluing those together with LangChain for two years.
I had a different reaction, because I've been on the other side of this. I've built the thing AgentKit is competing with. Gluon, my open-source multi-agent orchestrator. Dagentic, a serverless agent framework. I run four or five Claude Code agents across separate projects most working days, which means I keep having to make the unglamorous decision nobody puts in a keynote: which orchestration layer do I actually reach for this time, and what does it cost me when I'm wrong.
So I didn't watch the keynote wondering whether OpenAI was late. I watched it wondering which problem they'd decided to solve — because the agent problem isn't one problem, and the part everyone demos is the part I worry about least.
The money says the stakes are real. The agentic AI market is projected to go from $7.06 billion in 2025 to $93.20 billion by 2032 — a 44.6% compound annual growth rate. And enterprises aren't waiting: 33% of enterprise software will incorporate agentic AI by 2028, up from less than 1% today. A market that size, growing that fast, with no settled standard, is exactly the kind of thing OpenAI has eaten before.
What AgentKit actually ships
Strip the keynote gloss off and AgentKit is a bet that the agent ecosystem's real pain isn't capability — it's assembly.
That bet is correct, and I can tell you why from the receipts. When I built Gluon, almost none of the hard work was the agent reasoning. The model could already reason. The work was everything around it: wiring tools, threading state between steps, building the evaluation harness so I could tell whether a change made things better or just different, standing up the deployment path. A working agent today means LangChain for the logic, something like Zapier or n8n for integrations, a homegrown eval rig, and a deployment pipeline you maintain yourself. It functions. It's also four things that can rot independently, and I've spent more weekends than I'd like keeping that stack from drifting apart.
AgentKit collapses that into one surface. The Agent Builder interface looks more like Canva than an IDE — visual workflow creation a non-technical team can actually touch. ChatKit gives you embeddable chat UIs that drop into an existing app. And the part I'd have killed for: built-in trace grading and automated prompt optimization, so the evaluation harness I hand-rolled comes in the box.
The early numbers back the assembly thesis rather than the intelligence one. Bain & Company reported a 25% efficiency gain through dataset curation and prompt optimization. Ramp noted that Agent Builder "significantly reduced iteration cycles and deployment times." Read those carefully: iteration cycles, deployment times, curation. Nobody's reporting that the agents got smarter. They're reporting that the scaffolding got cheaper. That matches exactly what was expensive when I did it by hand.
It supports third-party models and pairs the visual builder with a code-first path, which is the right call — it lets the same product court the business user dragging boxes and the engineer who wants to drop to code when the boxes run out. That's not OpenAI fighting LangChain for the existing pie. It's OpenAI trying to grow the pie past the people who can already wire this stuff themselves.
The fragmented kingdom
To see why "assembly" is the right thing to attack, look at what the last two years actually felt like for anyone shipping agents.
LangChain dominates with roughly 450,000 developers worldwide and 43% of agentic-workflow implementations in production. CrewAI has 19% adoption among advanced builders, strong in Europe and Asia. AutoGen leads on multi-agent orchestration for research work. Three serious frameworks, three different mental models, zero agreement on what an "agent" even is.
I've lived inside that fragmentation. Choosing between these isn't picking a favourite — it's a bet with a switching cost attached, and you don't find out whether you bet right until you're deep enough that backing out hurts. Every team rebuilds the same evaluation and deployment plumbing because nobody shares it. The "tool sprawl" that ate enterprise software a decade ago was quietly repeating itself in the agent layer, and the bill landed as time spent on infrastructure instead of the thing you were actually trying to build.
That's a market waiting to be consolidated. But consolidation needs a consolidator with three things at once: developer trust, enterprise relationships, and real platform-operating experience. Very short list of companies that have all three.
The "OpenAI-compatible" move, again
Here's the part of OpenAI's history I take seriously, because I've felt it as a developer rather than read about it as an analyst.
OpenAI didn't invent the completions API. They made theirs the one everyone copies. "OpenAI-compatible" is now a phrase you'll find in the docs of platforms that compete directly with OpenAI — they standardised the interface so thoroughly that their rivals advertise compliance with it. That didn't happen because the API was technically untouchable. It happened because the developer experience was good enough, often enough, that building against it became the path of least resistance. I've reached for that endpoint myself when I had a dozen alternatives, purely because it was the one I didn't have to think about.
AgentKit is that move pointed at agents. Make the orchestration layer the obvious default, and the default quietly becomes the standard. The network pieces are already moving — OpenAI is positioning ChatGPT as a surface third-party apps run inside, and Booking.com, Expedia, Figma, Spotify, Khan Academy, Instacart, and Uber have already signed on. That's the same playbook that made the API a standard, now aimed one layer up the stack. Whether it lands is a separate question. The intent is unmistakable, and I've been on the receiving end of it working enough times not to dismiss it.
Why October 2025 isn't an accident
Three things had to be true at once for this to be good timing rather than late timing, and in 2025 they finally were.
Enterprises got past pilots. The pilot-to-production crossing for enterprise software runs 6-8 months. Companies that started agent pilots in early 2024 are now standing at exactly the platform-decision moment AgentKit is built to win.
The boring parts matured. Evaluation frameworks, governance tooling, security standards — the unglamorous scaffolding — moved from experimental to dependable. Trace grading, performance datasets, automated optimization: the very pieces I had to build myself for Gluon are now stable enough to ship as a product. You couldn't have shipped AgentKit credibly in 2023 because the ground it stands on didn't exist yet.
The market is expanding, not just growing. Going from $7 billion to $93 billion isn't a bigger version of the same thing. Industry observers call 2025 a "critical inflection point where agentic systems can finally move from pilot to production with confidence." That's agents crossing from a developer toy into business-critical infrastructure — which is a different, larger buyer.
The proof points are already in production. Klarna deployed AgentKit across a significant share of its support tickets. HubSpot uses it to power their Breeze assistant for sales automation. Different industries, real load, not a demo on a stage.
What it won't do for me
This is where my enthusiasm runs out, and it's the part the keynote skipped.
Everything AgentKit makes cheaper sits before the agent acts. Build the workflow faster, grade the traces, tune the prompts, deploy in a day instead of a fortnight. Genuinely valuable, and I'd take all of it. But it doesn't touch the bottleneck I keep slamming into across every agent project I've shipped: not "can the agent do the work?" — it can, that question's been answered — but "can a human confirm it before the action goes out the door?"
Run four or five agents at once, as I do, and that verification cost is the whole game. A prettier builder lets me create more agents faster, which means more output to check, which makes the bottleneck worse, not better. The trace grading helps after the fact. It doesn't help me decide, in the moment, whether to let this particular agent ship this particular thing unsupervised. No visual canvas solves that, because it isn't a tooling problem. It's a trust problem, and trust doesn't drag-and-drop.
So here's the honest split. Some of AgentKit's competition I'd still reach for, some of it I now wouldn't, and the reasons aren't the ones a feature comparison would predict.
I'd reach for AgentKit the moment I want a non-engineer on my team to stand up something useful without me — that's a capability LangChain has never genuinely offered, and it's most of why OpenAI is expanding the market rather than just raiding LangChain's. I'd reach for it for a standard support or sales-automation flow where the OpenAI-compatible gravity, the partner ecosystem, and the in-box eval harness mean I'm not maintaining four tools to ship one agent.
I wouldn't reach for it where I need the orchestration to bend in ways a vendor's abstraction won't allow — which is most of why Gluon exists. Frameworks like LangChain and CrewAI carry real switching costs, but they buy real control, and the day AgentKit's boxes run out is the day you discover whether you're a tenant or an owner. And I wouldn't lean on it for anything where the human-verification gap is the actual risk, because that's not in the box and OpenAI isn't pretending it is.
Simon Willison flagged the same fault line — the gap between OpenAI's expansive "a system that can do work independently" and the narrower, tool-shaped definition most working developers carry. He's right that it's ambiguous. I'd put it more bluntly: OpenAI is selling the system, and the system's hardest part is still the one nobody's selling.
The verdict
So, late or masterful? Having built the thing it competes with, my answer is: deliberately, narrowly masterful — and only on the half of the problem they chose to fight on.
The half they took is the one I'd have paid them for in my Gluon weekends. The assembly tax on agents is brutal and unglamorous, and collapsing four tools into one defensible platform, at the exact moment enterprises cross from pilot to production, is a clean strategic shot. If the OpenAI-compatible gravity does to orchestration what it did to completions, they consolidate a fragmented market right as it goes vertical. That's not a company scrambling to catch up. That's a company that waited until the boring parts were finally boring enough to productise — which, after twenty years of watching platforms win, is the version of "late" I've learned to bet on.
The half they left on the table is the one that actually keeps me up: an agent's value is now capped not by what it can do but by how fast a human can trust what it did. AgentKit makes the front of that pipeline cheaper and leaves the back of it exactly where it was. Which means the better it works, the more output it produces for someone to verify — and that someone is still me, still reading the trace, still deciding whether to let it ship.
OpenAI built the best on-ramp the agent market has had. The traffic jam is a mile further down the road, and they didn't touch it. Neither has anyone else. That's the post I'll be writing next.
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
Your AI Team Did Nothing While You Slept
Anthropic let Claude run a real shop for a month. It sold metal cubes at a loss, invented a Venmo account, and claimed to wear a blazer. The 'AI department that works while you sleep' is a genre — here's where it actually breaks.
The Productivity J-Curve: Why Your AI Pilot Looks Worst at Week 6
METR ran the experiment. AI made experienced developers 19% slower — and they reported feeling 20% faster. The week-6 dip is the bottom of a documented J-curve. Most pilots get cut here. The right ones don't.
The 5-Step Loop: Why Your Agent Fails at Step 4
ReAct gave us a three-step loop. Production hardened it into five. The two new steps — Plan and Verify — are where everything that goes wrong, goes wrong. And the field has now named the worst offender.