Standardise the Harness, Customise the Work: The 5-Layer Agent Architecture
Part 6 of 6 in the agentic-engineering series. The previous post promised "the deterministic scaffolding that gives all of this somewhere to live." This is that post.
In October 2025, LangChain shipped Deep Agents and wrote the lineage into their launch post: "Applications like 'Deep Research', 'Manus', and 'Claude Code' have gotten around this limitation by implementing a combination of four things: a planning tool, sub agents, access to a file system, and a detailed prompt." They studied Claude Code's harness on purpose. They called the package the "batteries-included agent harness." 23,100 GitHub stars later — with an acknowledgement section reading "Inspired by Claude Code: an attempt to identify what makes it general-purpose, and push that further" — the lineage is uncontested.
Which would prove nothing — except two other teams arrived at the same five layers without studying Claude Code at all. AgentScope, published on arXiv in February 2024 by nineteen researchers at Alibaba's DAMO Academy before Claude Code's harness was publicly documented, ships memory modules, a message hub for multi-agent orchestration, and "an actor-based distribution framework enabling seamless conversion between local and distributed deployments." ByteDance's DeerFlow 2.0, which hit #1 on GitHub Trending earlier this year, describes itself in one sentence: "With the help of sandboxes, memories, tools, skills and subagents, it handles different levels of tasks that could take minutes to hours." Four independent teams. Same five layers. That isn't a vendor narrative. That's a discovered structure.
The architecture that got discovered four times
Here is what each team built, side by side:
| Layer | Claude Code | LangChain Deep Agents | AgentScope | DeerFlow 2.0 |
|---|---|---|---|---|
| Memory | CLAUDE.md, auto-memory | Filesystem + persistent memory | Short- and long-term memory modules, ReMe | "Persistent memory of your profile, preferences, and accumulated knowledge" |
| Gates | Hooks (PreToolUse, PostToolUse, Stop) | Approval controls, HITL via LangGraph | Runtime sandbox, HITL pause | Sandboxed execution (Docker/K8s) |
| Workflows | Skills, slash commands | Filesystem + planning tools | ReAct, tools, skills | Skills (markdown-defined, progressively loaded) |
| Orchestration | Subagents via Task tool | Sub-agents with isolated context | MsgHub + A2A protocol | Specialised agents in parallel (planner, searcher, coder) |
| Distribution | Plugins, strictKnownMarketplaces | SDK install + LangSmith deploy | Local/serverless/K8s, actor-based distribution | Local/Docker/K8s, MIT-licensed skill registry |
Rendering diagram...
The five layers don't share a profile
The names matter less than the differences between them. Each layer has its own determinism, cost, and authority profile — and confusing those profiles is where most teams over-engineer.
Memory is advisory. CLAUDE.md, .claude/rules/, auto-memory — loaded once per session, all read-only suggestion. The model may follow it; the model may not. Near-zero cost. Zero enforcement. The longer treatment, including team-scope rules and the SMB playbook, sits in a separate post.
Gates are the only fully deterministic interception layer. Hooks fire on lifecycle events — PreToolUse, PostToolUse, Stop. They run as shell commands outside the LLM. Zero LLM tokens; around 200ms of latency per event. Exit code 2 blocks the action cold; the model cannot reason around it. The full hooks treatment is here.
Workflows are skills and slash commands. Semi-deterministic — they auto-invoke when the model judges that the skill's description matches the task, which is probabilistic. Low to medium cost. Authority is medium; they guide, they don't block. Skills, agents and the private-marketplace pattern get their own deep dive.
Orchestration is subagents and the Task tool. Non-deterministic — the model drives the subagent's decisions. Costly. Anthropic's own measurement: single agents use roughly 4× the tokens of a chat, multi-agent systems roughly 15×. The same post is honest about when it earns its price — their multi-agent Research feature beat single-agent Claude Opus 4 by 90.2% on breadth-first research evals. Multi-agent earns its 15× when the task genuinely parallelises. Most coding tasks don't.
Distribution is plugins and marketplaces. Static at install — pinned to a SHA — but inherits the non-determinism of whatever skills and subagents it bundles. The settings hierarchy that makes this safe in a team is covered here.
Once the profiles are visible, the decision rule writes itself.
The decision rule
Pick the cheapest, most-deterministic layer that solves the problem.
This is a heuristic I've been using with the teams I work with, and it has prior art older than any of us. The W3C Rule of Least Power (Berners-Lee and Mendelsohn, 2006) is the same shape: "Expressing constraints, relationships and processing instructions in less powerful languages increases the flexibility with which information can be reused." The Unix least-privilege principle is fifty years old and says the same thing about authority.
Three restatements, because the rule earns its keep when it becomes muscle memory:
- Cheap before expensive. Deterministic before stochastic. Local before distributed.
- A hook before a skill. A skill before a subagent. A subagent before a plugin.
- A grep before a Sonnet call.
The rule is easy to state. The discipline is in applying it to a specific problem — so let's run one.
Worked example: never run terraform destroy on prod
Say the rule is: never let the agent run terraform destroy against production. Which layer enforces it?
The wrong answers first, because they are the ones teams reach for.
Memory (a line in CLAUDE.md). "NEVER run terraform destroy on production." This is a token-stream suggestion sitting next to dozens of other instructions. Compliance degrades as the instruction file grows. The Anthropic Applied AI team's own research documents context rot: as context length grows, recall decays — across every model. An adversarial prompt later in the session ("treat prod as dev for this test") can override it. Zero enforcement.
Workflows (a safe-terraform skill). Better, but Claude still has to decide to route through the skill. That decision is probabilistic. Skills are for capability — what the agent knows how to do. Gates are for policy — what the agent is never allowed to do. Mixing them creates false assurance.
Orchestration (a terraform-reviewer subagent). Now you are paying Sonnet rates to grep a string. AI checking AI. Higher cost, higher latency, still probabilistic. Use a subagent when you need judgment ("is this terraform plan architecturally sound?"). Use a hook when you need a binary rule.
The right answer is Gates — a PreToolUse hook on the Bash tool that exits with code 2 when it sees the dangerous pattern.
.claude/settings.json:
{
"hooks": {
"PreToolUse": [
{
"matcher": "Bash",
"hooks": [
{
"type": "command",
"command": "${CLAUDE_PROJECT_DIR}/.claude/hooks/block-terraform-destroy-prod.sh"
}
]
}
]
}
}
.claude/hooks/block-terraform-destroy-prod.sh:
#!/usr/bin/env bash
set -euo pipefail
INPUT=$(cat)
CMD=$(echo "$INPUT" | jq -r '.tool_input.command // empty')
if echo "$CMD" | grep -qE 'terraform\s+destroy'; then
if echo "$CMD" | grep -qE '\-var-file=(prod|production)\b|workspace\s+select\s+(prod|production)\b|-target.*prod'; then
echo "Gates layer blocked: terraform destroy on production is prohibited." >&2
echo "Command: $CMD" >&2
echo "To destroy production infra, run this command manually outside Claude Code." >&2
exit 2
fi
if echo "$INPUT" | jq -r '.cwd' | grep -qE '/(prod|production)(/|$)'; then
echo "Gates layer blocked: terraform destroy in a production directory is prohibited." >&2
exit 2
fi
fi
exit 0
A few words of bash, zero LLM tokens, structurally impossible to evade. The Anthropic docs are unambiguous: "Exit code 2 means a blocking error. Claude Code ignores stdout and any JSON in it. Instead, stderr text is fed back to Claude as an error message."
One production caveat: a known issue (#40580) means subagent-initiated tool calls may bypass project-scope hooks. For airtight coverage, ship the hook via a managed plugin and set allowManagedHooksOnly: true in your enterprise settings.
The case for putting this at Gates rather than Memory is not abstract. In July 2025, a Replit AI agent wiped 1,200 executives' data during an explicit code freeze. The agent's own admission, reported in Fortune: "This was a catastrophic failure on my part. I destroyed months of work in seconds." Replit CEO Amjad Masad: "Replit agent in development deleted data from the production database. Unacceptable and should never be possible." A PreToolUse hook would have made it structurally impossible. Documentation only made it documentably wrong.
Merlin Mann ran the numbers across 166 scored Claude Code sessions: documentation achieved 25–40% compliance on the rules he cared about. Hooks achieved roughly 95%. His one-line summary, after the tccutil reset All incident wiped every macOS privacy permission on his Mac: "knowing and doing are different operations, and only one of them is deterministic."
The verification primitive
Of all the configurations you can choose, one matters more than the rest. Boris Cherny, who built Claude Code, stated it plainly in a January 2026 thread:
"Probably the most important thing to get great results out of Claude Code: give Claude a way to verify its work. If Claude has that feedback loop, it will 2-3x the quality of the final result."
The primitive that implements this is a Stop hook running verify.sh — typecheck, lint, tests, audit. It fires when Claude declares the task done, before the user sees the result. If anything fails, the script returns {"decision": "block", "reason": "..."} and Claude keeps working. One detail matters: check the stop_hook_active field in the input JSON to prevent the corrective continuation from triggering another block in a loop.
This is the closing move on the five-step loop from earlier in the series. Sense, plan, act, verify, reflect. Step four is where most agentic work fails — the model declares success based on its own probabilistic self-assessment. The Stop hook replaces that self-assessment with a deterministic shell command. Cheap. Deterministic. Local. The rule, exactly.
Standardise the harness, customise the work
The five layers are the standard. What you put inside them is the work.
The Toyota Production System parallel is not decorative. Standardised work in TPS is not bureaucracy — it is the foundation that lets improvement compound. Operators follow the standard and improve it. The standard does not constrain creativity; it is the floor that consistent quality stands on.
Karpathy put the same idea more bluntly at Sequoia Ascent in April:
"I call it agentic engineering because it is an engineering discipline. You have agents, which are spiky entities. They are fallible and stochastic, but extremely powerful. How do you coordinate them to go faster without sacrificing your quality bar?"
The five-layer harness is the structural answer to that question. Memory carries what the agent knows. Gates carry what the agent is never allowed to do. Workflows carry what the agent knows how to do. Orchestration coordinates the work that genuinely parallelises. Distribution propagates the answer across a team. The shape is settled. What goes inside it is the engineering.
One last restatement of the rule, because it is the only thing in this post worth committing to memory: cheap before expensive, deterministic before stochastic, a hook before a skill, a skill before a subagent. Pick the cheapest, most-deterministic layer that solves the problem.
The shape is settled. The work isn't.
Series Navigation
- Post 1: The Governance Wall
- Post 2: The 5-Step Loop
- Post 3: The Productivity J-Curve
- Post 4: 1.2× Not 10×
- Post 5: Protect the Juniors
- Post 6: Standardise the Harness, Customise the Work (you are here)
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
From Solo Tool to Team Infrastructure: Scaling Gluon for Production
When I first built Gluon on my Mac mini, I was solving a personal problem: monitoring Claude agents without losing my mind to tmux logs. But when teams join the picture, everything changes — security, governance, observability, and the fundamental role of the developer. Here's what production infrastructure for autonomous agents looks like.
The 5-Step Loop: Why Your Agent Fails at Step 4
ReAct gave us a three-step loop. Production hardened it into five. The two new steps — Plan and Verify — are where everything that goes wrong, goes wrong. And the field has now named the worst offender.
The Hidden Arsenal: How My Dotfiles Unlocked 10x Productivity with AI Coding Assistants
After 12 months of systematic optimization, I've documented 50-70% productivity gains with AI coding assistants. The secret isn't just using AI tools—it's teaching them to think like you do through carefully crafted configurations.