From Prompt Engineering to Context Engineering: The Skill Didn't Die. It Got Harder.
The $335k That Broke Twitter
In March 2023, a single posting on the Anthropic careers page broke the internet. "Prompt Engineer and Librarian." Salary band: $175,000 to $335,000. By April, "prompt engineer" hit its all-time peak on Indeed, and over three months LinkedIn searches for the term spiked 7,000 percent.
Two years later, LinkedIn profiles listing "prompt engineer" had cratered. One bootcamp graduate sent 200 applications, landed three interviews, zero offers. The Wall Street Journal ran the piece as "Talking to Chatbots Is Now a $200K Job." Sam Altman had predicted it the year before: "I don't think we'll be doing prompt engineering in five years." He was wrong about the timeline. The job title was dead in two.
But the title didn't die. It migrated.
A Discipline Named for Its Most Visible Technique
The job title named the most visible ten percent of the work — the words you typed. The other ninety percent was invisible: the examples, the structure, the retrieval, everything the model saw before it generated a token.
The papers showed the direction from the start. Chain-of-Thought Prompting, Google Brain, January 2022, was about what gets shown to the model, not how to phrase the request: eight exemplars of step-by-step reasoning turned a 540-billion-parameter model into state-of-the-art on GSM8K. ReAct, Princeton and Google, October 2022 — the first widely-cited agent pattern — interleaved reasoning traces with tool outputs. That's not a prompt. That's an information architecture. By May 2023, Tree of Thoughts took GPT-4 on the Game of 24 from four percent to seventy-four percent by structuring the search space, not finding magic words. Even Anthropic's own 2023 guides recommended XML tags, scratchpads, and placing instructions at the end of long prompts — context architecture dressed as prompt advice.
Simon Willison defended the discipline in February 2023: "The best prompt engineers are meticulous," he wrote, "they iterate on their prompts and try to figure out exactly which components are necessary." By June 2025 he had flipped, conceding it had become "a laughably pretentious term for typing things into a chatbot." The name was always too small; the surface that mattered just moved deeper into the system.
The Frankenstein Prompt Ceiling
Hamel Husain has consulted on production LLM systems since GPT-3. In the Applied LLMs guide he co-authored in June 2024, he documented Rechat's AI assistant, Lucy: rapid early progress from prompting, then a plateau, then a hard ceiling. Fix one failure mode, another pops up. His quote is the gravestone of the old paradigm: "Our initially simple prompt is now a 2,000 token Frankenstein. And to add injury to insult, it has worse performance on the more common and straightforward inputs."
That's an architectural problem, and the research now explains it cleanly. Lost-in-the-Middle, Liu et al., Stanford, TACL 2024: in a 20-document QA task, accuracy drops more than 30 percent when the key fact sits in positions 5–15 versus 1 or 20 — a U-shaped curve, consistent across every model tested. An instruction buried mid-context isn't being weighed; it's being quietly ignored, because that's how transformer attention distributes itself. Chroma's 2025 Context Rot study pushed this further: eighteen frontier models — GPT-4.1, Claude Opus 4, Gemini 2.5 Pro, Qwen3 — all degraded as the window filled. Not some. All. A 200K-window model can degrade significantly well before 50K.
Then errors compound. Chain-of-thought is beautiful for single inferences and brutal for agent loops: at one percent error per step, an agent fails after an expected hundred steps, and real coding agents routinely run forty turns or more. Cognition found coding agents spending over 60 percent of their first turn just retrieving context. The longer the run, the more compounding error turns a linear cost curve into a cliff.
The deepest break: prompts were designed for a clean slate; agents inherit everything. Identical prompt, different context, catastrophically different behaviour. The prompt was never really the variable. Stanford shipped DSPy in October 2023 on exactly that premise — "existing LM pipelines are typically implemented using hard-coded prompt templates, i.e. lengthy strings discovered via trial and error" — and automated the whole thing. The implicit argument is almost unkind: if prompting were tractable for complex pipelines, you wouldn't need to automate it.
The Naming Moment — June 2025
The field needed a better name, and across two weeks of June 2025 it got one.
June 19. Tobi Lütke, CEO of Shopify, posts the sentence that starts the avalanche: "It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM." June 23. Harrison Chase publishes "The rise of context engineering", and Lance Martin drops the four-practice taxonomy. June 25. Andrej Karpathy amplifies: "context engineering is the delicate art and science of filling the context window with just the right information for the next step... Doing this well is highly non-trivial." June 27. Simon Willison endorses the term. June 30. Philipp Schmid at Google DeepMind publishes the definition practitioners now quote — "Context Engineering is the discipline of designing and building dynamic systems that provides the right information and tools, in the right format, at the right time" — plus the diagnostic line that reframes every agent post-mortem: "Most agent failures are not model failures anymore, they are context failures."
Eleven days. Four definitions. One discipline.
By July, Drew Breunig measured the trend: "in a month, it's over a quarter of the search volume for 'prompt engineering'" — and climbing. Anthropic formalised it on September 29, 2025 with a full engineering post on "attention budget" and "context rot" as first-class concepts. By January 2026, Chase was on Sequoia's Training Data podcast almost laughing at the fit: "Context engineering is such a good term. I wish I came up with that term. It actually really describes everything we've done at LangChain without knowing that that term existed."
What Context Engineering Actually Is
Rendering diagram...
Lance Martin's taxonomy is the cleanest mental model. Write context — scratchpads, memory tools, NOTES.md files. Select context — RAG, memory retrieval, tool and few-shot selection. Compress context — summarisation, trimming, pruning. Isolate context — sub-agents with focused windows, sandboxes, state-schema boundaries. Karpathy's analogy sits underneath all of it: the LLM is the CPU, the context window is the RAM, and context engineering is the operating system that decides what gets loaded before each instruction runs. Anthropic's engineering post makes this concrete — compaction, structured note-taking (demonstrated, hilariously, by Claude playing Pokémon and maintaining precise tallies across thousands of game steps), and sub-agents that burn tens of thousands of tokens each and return 1,000–2,000-token distillations to a parent.
Philipp Schmid's example is the clearest "before and after" I've seen. Same LLM, same task: "Can you do a quick meeting tomorrow?" The prompt-only agent replies "Thank you for your message. Tomorrow works for me. May I ask what time?" — serviceable, robotic, useless if you have a calendar. The context-engineered agent has already pulled in the calendar, the prior email thread, the contact's history, and the tools, and replies: "Hey Jim! Tomorrow's packed on my end, back-to-back all day. Thursday AM free if that works for you? Sent an invite, lmk if it works." The model didn't change. The magic was everything that happened before it was called. As Dex Horthy, whose 12-factor agents repo has quietly accumulated 19,400 stars, puts it: "LLMs are stateless functions that turn inputs into outputs. To get the best outputs, you need to give them the best inputs."
This matches what I see running four or five Claude Code agents across my own projects. When one produces something useless, I almost never trace it to a badly worded instruction. I trace it to what was in the window when it answered — a stale file it read forty steps ago, a half-finished tool result, the original task buried under a wall of accumulated noise. The phrasing was fine. The context had rotted. I spend far more of my time now deciding what an agent should see than what I should say to it.
The Career S-Curve
What the market did with all this is a three-act play.
Act 1, April 2023. Anthropic's posting. Indeed searches peak. Over 250,000 U.S. LinkedIn postings mention "prompt engineer." DeepLearning.AI launches ChatGPT Prompt Engineering for Developers in May and gets 300,000 signups in under a week. Top salary: $335,000. Hype hardening into a job title.
Act 2, June 30, 2023. While the hype still peaks, Shawn "swyx" Wang publishes "The Rise of the AI Engineer": "to wield them, we'll have to go beyond the Prompt Engineer and write software." Within 24 hours, 1,000-plus pre-register for the first AI Engineer Summit; it sells out at 500 seats that October, and by June 2024 the World's Fair hosts 3,000-plus engineers. The title migrates while the obituary writers aren't watching. swyx's line ages beautifully: "Prompt Engineering was both overhyped and here to stay."
Act 3, 2025–2026. Fortune, Fast Company and the WSJ run the obituary wave in May 2025. Meanwhile LinkedIn names "AI Engineer" the #1 fastest-growing U.S. job title for the second year running, postings up 143 percent year-over-year, even as "prompt engineer" profiles fall 40 percent from mid-2024 to early 2025. PwC's 2025 Global AI Jobs Barometer finds AI-skilled workers command a 56 percent wage premium, up from 25 percent the year before. The premium doubled.
The financial proof is almost embarrassing. Cursor — fundamentally a context engineering system, auto-injecting codebase structure and recent changes before each call — went from $500 million ARR in June 2025 to $1 billion in November to $2 billion by February 2026. The windows beneath all this went from 4K tokens at ChatGPT's launch to 200K with Claude 2.1 to 1 million with Gemini 1.5 Pro to ten million with Llama 4 Scout. Anthropic's Model Context Protocol, launched November 2024, already has 5,000–10,000 community servers. Gartner expects 40 percent of enterprise applications to include task-specific AI agents by end of 2026, up from under 5 percent in 2025.
Every one of those agents is context engineering in production, whether or not the people building them call it that — and at production scale, the next obstacle is rarely the model. It's the governance wall.
Where to Invest Now
What to study. Five resources that repay your time more than any prompt engineering course did. Lance Martin's four-practice taxonomy — Write, Select, Compress, Isolate — for the mental model. Anthropic's "Effective context engineering for AI agents" for the formal manual, attention budget and context rot included. Dex Horthy's 12-factor agents for owning your context window rather than inheriting a framework's. Drew Breunig's failure taxonomy — Poisoning, Distraction, Confusion, Clash — for a debugging vocabulary. And Hamel Husain on evals, because context engineering without measurement is cargo cult with extra steps.
The posture shift. Stop obsessing over phrasing; obsess over architecture. Stop writing monster system prompts; build systems that assemble clean context at runtime. Stop asking "what should I tell the model?" and start asking "what should the model see?" The answer is almost always less than you think — Anthropic's own guidance is to find the smallest possible set of high-signal tokens that still does the job. At team scale this becomes a progression you can map, which I've laid out as a five-stage maturity model for AI engineering teams.
The honest complication. Gary Marcus would point out, not unreasonably, that both prompt and context engineering are elaborate workarounds for systems that don't reliably understand what they're doing — "patches of competence separated by regions of incompetence." He's not wrong. The engineering discipline is legitimate, well-paid, increasingly necessary; the underlying reliability problem is also real and unsolved. Both are true at once. You can build a career in the gap.
Ethan Mollick put it better than I can, in January 2026: "The skills that are so often dismissed as 'soft' turned out to be the hard ones." Knowing what good looks like. Explaining it clearly enough that even an AI can deliver it. Curating what your systems see. Measuring whether they did the thing. None of that fits on a resume line that says "prompt engineer." All of it is the work — and it only gets more load-bearing as the models get good enough that you stop checking.
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
Protect the Juniors: Cognitive Debt and the Stack Overflow Collapse
AI is making junior output look senior-level while preventing junior skill from forming — and the Stack Overflow collapse just removed the ambient learning layer that used to catch the deficit. Three interventions that work.
Manager Mode: When AI Does the Work, Everyone Becomes Middle Management
AI is silently promoting every knowledge worker to middle management — without the title, the training, or the pay. This is what that shift actually looks like from a Singapore desk.
Tools, Then Teammates, Then Autonomy — Part 2: The Autonomy Gate
Clearing the wall: what Phase 3 autonomy actually looks like, the regulatory gate that turns out to be the design, and the two gates that tell you when you're allowed to move.