Characterisation Tests Before Agents Touch Brownfield Code
A brownfield refactor I watched go wrong recently: the team pointed an agent at a date-parsing utility that had been quietly working for six years. The function had three subtle, undocumented quirks — a particular treatment of Singapore Standard Time, a weird fallback for two-digit years, and a swallowed exception that downstream code depended on. The agent took one look, declared it "needlessly complex," and rewrote it. The new version was cleaner, well-typed, and broke three production reports on Monday morning.
Nothing about the new code was wrong in the abstract. It just didn't do what the old code did. And there were no tests asserting what the old code did, because nobody had ever written them. There was nothing to fail.
This is the brownfield over-refactor anti-pattern: agents eagerly rewriting stable code without a safety net, producing what Tian Pan accurately called "plausible-looking, syntactically valid, semantically wrong changes." (Tian Pan, April 2026). The fix isn't a smarter agent. It's a 20-year-old technique that finally has the right tool to make it cheap.
What Feathers actually said
Michael Feathers, Working Effectively With Legacy Code (Prentice Hall PTR, 2004, ISBN 978-0131177055), gave us the technique. The definition from Chapter 13 is one sentence:
"A characterization test is a test that characterizes the actual behavior of a piece of code."
Not the intended behaviour. Not what's documented. Not what would be correct in a clean implementation. The actual behaviour, including the quirks. You write the test by running the code, observing what it does, and pinning that down. If the quirk is a bug, you've now made it visible — and you can decide to fix it deliberately later. If the quirk is load-bearing (as it usually is in legacy code), you've protected the contract before any change touches it.
Feathers wrote this for humans staring at unloved code. The discipline applies one-for-one to agents, with one improvement: writing characterisation tests is exactly the kind of mechanical observation work agents are good at.
The agent is the perfect characterisation-test writer
The prompt template I use is short and boring:
Read the function at <path>:<line>. Write a complete set of tests
that capture its current behaviour exactly. For each test:
- Generate inputs that exercise a meaningful branch or edge.
- Run the function and record the actual output verbatim.
- Assert on that exact output — quirks, bugs, and all.
Do NOT correct anything you think looks wrong. The job is
to pin down what the code does today.
Cover: typical inputs, boundary inputs, null/undefined,
empty collections, malformed inputs, and any obvious overflow.
Run it. The agent reads the function, generates 15–30 tests, and runs them to discover the actual outputs. The output of the agent is now a test file that fails immediately if any future change shifts behaviour — including changes the agent itself might make later.
A worked example: a function that maps Singapore postcodes to delivery zones with a fallback for unknown ones. Old code returns "WEST" for unknown postcodes — almost certainly wrong, almost certainly relied upon downstream. The characterisation test the agent produces:
import { mapPostcodeToZone } from "./delivery";
describe("mapPostcodeToZone (characterisation)", () => {
it("returns CENTRAL for 049315", () =>
expect(mapPostcodeToZone("049315")).toBe("CENTRAL"));
it("returns EAST for 460001", () =>
expect(mapPostcodeToZone("460001")).toBe("EAST"));
// Quirk: unknown postcodes silently return WEST.
// This is almost certainly a bug — pin it down anyway.
it("returns WEST for an unknown postcode", () =>
expect(mapPostcodeToZone("999999")).toBe("WEST"));
it("returns WEST for empty string (no validation)", () =>
expect(mapPostcodeToZone("")).toBe("WEST"));
it("throws for null input", () =>
expect(() => mapPostcodeToZone(null as unknown as string)).toThrow());
});
The middle two tests are the ones that matter. They capture a quirk the original author might not have meant. When a downstream change later wants to fix the silent WEST fallback, that's now a deliberate change with a failing test to acknowledge and update — not an invisible behavioural shift.
Lock the agent out until the safety net is in place
The harness layer earns its keep here. Add a permission deny rule to your settings.json for any module the agent isn't allowed to touch:
{
"permissions": {
"deny": [
"Edit(src/legacy/billing/**)",
"Edit(src/legacy/dates/**)"
]
}
}
This blocks the agent from editing those paths until you explicitly unlock them — typically by generating characterisation tests first, confirming they pass, and then removing the deny rule for the specific module. The discipline becomes mechanical rather than aspirational. The hook says no on your behalf at exactly the moment the agent decides to be helpful.
Coverage honesty: line coverage lies
There's a subtle trap once the characterisation tests are in place: an agent (or human) can produce tests that exercise every line without actually asserting on the behaviour that matters. Line coverage will look great. The safety net will have holes.
Mutation testing is the honest measurement. The tool deliberately mutates the production code — flips a comparison, removes a return, inverts a boolean — and re-runs the test suite. If a mutant survives (no test fails), the suite missed a behavioural assertion. The surviving-mutant count is the real coverage gap.
The toolchain depends on your stack:
- JavaScript / TypeScript / C# / Scala — Stryker, current release v9.0.1 (May 2025). Tagline: "Test your tests with mutation testing." Note: contrary to what some references suggest, Stryker does not support Java.
- Python — Cosmic Ray, active beta from sixty-north. Mature enough for production use on individual modules.
- Industrial scale — Meta's Automated Compliance Hardening (ACH), published September 2025 by Mark Harman, uses LLMs to generate compliance-targeted mutants and matched tests. In an October–December 2024 trial across Facebook, Instagram, WhatsApp, and Meta wearables, "73% of the generated tests were accepted by engineers." That number is the upper bound for what's realistic when LLMs author the mutation suite.
For a brownfield refactor in agent hands, the playbook is:
- Generate characterisation tests with the agent.
- Run mutation testing on the new tests. Aim for a mutation score above 80% on the module being refactored.
- Only after step 2 passes, remove the
Edit(...)deny rule for that path. - Let the agent refactor. The tests now hold the contract.
Steps 1 and 2 cost a few minutes of agent time on a typical small module. Step 4 is the part that used to be terrifying. The first two steps make it boring.
What this is really doing
The brownfield over-refactor anti-pattern is what happens when an agent is allowed to be opinionated about code that has no defended contract. The fix isn't to make the agent more cautious — well-tuned agents are still over-eager, that's their job. The fix is to defend the contract first, then let the agent be as opinionated as it likes within the boundary.
Feathers wrote Working Effectively With Legacy Code before AI agents existed. Twenty years later it remains the cheapest piece of legacy-code discipline you can adopt — and the first time we've had a tool that makes writing the characterisation tests fast enough that there is no excuse not to.
Don't let an agent touch code you can't roll back to. Generate the safety net first. Mutation-test it to make sure the net has no holes. Then refactor with both hands free.
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
The 30 Principles for Agentic Engineering — Part 3: The Harness
Principles 15–20. The harness configuration that keeps the kernel and lifecycle cheap: CLAUDE.md under 200 lines, hooks for real incidents, skills that auto-invoke, subagent isolation, pinning, and Stage 5 distribution.
The 15-Tool-Call Rule: Where Agent Quality Falls Off a Cliff
Practitioner consensus puts the cliff around fifteen tool calls per prompt. Here's why agents degrade past that, and the three operational rules that keep them on the safe side.
Standardise the Harness, Customise the Work: The 5-Layer Agent Architecture
Three open-source extractions converged on the same five layers. The architecture isn't a vendor narrative — it's a discovered structure. Here's the decision rule that keeps you from over-engineering it.