Characterisation Tests for Brownfield AI Refactors

A brownfield refactor I watched go wrong recently: the team pointed an agent at a date-parsing utility that had been quietly working for six years. The function had three subtle, undocumented quirks — a particular treatment of Singapore Standard Time, a weird fallback for two-digit years, and a swallowed exception that downstream code depended on. The agent took one look, declared it "needlessly complex," and rewrote it. The new version was cleaner, well-typed, and broke three production reports on Monday morning.

Nothing about the new code was wrong in the abstract. It just didn't do what the old code did. And there were no tests asserting what the old code did, because nobody had ever written them. There was nothing to fail.

This is the brownfield over-refactor anti-pattern: agents eagerly rewriting stable code without a safety net, producing what Tian Pan accurately called "plausible-looking, syntactically valid, semantically wrong changes." (Tian Pan, April 2026). The fix isn't a smarter agent. It's a 20-year-old technique that finally has the right tool to make it cheap.

What Feathers actually said

Michael Feathers, Working Effectively With Legacy Code (Prentice Hall PTR, 2004, ISBN 978-0131177055), gave us the technique. The definition from Chapter 13 is one sentence:

"A characterization test is a test that characterizes the actual behavior of a piece of code."

Not the intended behaviour. Not what's documented. Not what would be correct in a clean implementation. The actual behaviour, including the quirks. You write the test by running the code, observing what it does, and pinning that down. If the quirk is a bug, you've now made it visible — and you can decide to fix it deliberately later. If the quirk is load-bearing (as it usually is in legacy code), you've protected the contract before any change touches it.

Feathers wrote this for humans staring at unloved code. The discipline applies one-for-one to agents, with one improvement: writing characterisation tests is exactly the kind of mechanical observation work agents are good at.

The agent is the perfect characterisation-test writer

The prompt template I use is short and boring:

Read the function at <path>:<line>. Write a complete set of tests
that capture its current behaviour exactly. For each test:
 - Generate inputs that exercise a meaningful branch or edge.
 - Run the function and record the actual output verbatim.
 - Assert on that exact output — quirks, bugs, and all.
Do NOT correct anything you think looks wrong. The job is
to pin down what the code does today.
Cover: typical inputs, boundary inputs, null/undefined,
empty collections, malformed inputs, and any obvious overflow.

Run it. The agent reads the function, generates 15–30 tests, and runs them to discover the actual outputs. The output of the agent is now a test file that fails immediately if any future change shifts behaviour — including changes the agent itself might make later.

A worked example: a function that maps Singapore postcodes to delivery zones with a fallback for unknown ones. Old code returns "WEST" for unknown postcodes — almost certainly wrong, almost certainly relied upon downstream. The characterisation test the agent produces:

typescript

import { mapPostcodeToZone } from "./delivery";

describe("mapPostcodeToZone (characterisation)", () => {
  it("returns CENTRAL for 049315", () =>
    expect(mapPostcodeToZone("049315")).toBe("CENTRAL"));
  it("returns EAST for 460001", () =>
    expect(mapPostcodeToZone("460001")).toBe("EAST"));
  // Quirk: unknown postcodes silently return WEST.
  // This is almost certainly a bug — pin it down anyway.
  it("returns WEST for an unknown postcode", () =>
    expect(mapPostcodeToZone("999999")).toBe("WEST"));
  it("returns WEST for empty string (no validation)", () =>
    expect(mapPostcodeToZone("")).toBe("WEST"));
  it("throws for null input", () =>
    expect(() => mapPostcodeToZone(null as unknown as string)).toThrow());
});

The middle two tests are the ones that matter. They capture a quirk the original author might not have meant. When a downstream change later wants to fix the silent WEST fallback, that's now a deliberate change with a failing test to acknowledge and update — not an invisible behavioural shift.

Lock the agent out until the safety net is in place

The harness layer earns its keep here. Add a permission deny rule to your settings.json for any module the agent isn't allowed to touch:

json

{
  "permissions": {
    "deny": [
      "Edit(src/legacy/billing/**)",
      "Edit(src/legacy/dates/**)"
    ]
  }
}

This blocks the agent from editing those paths until you explicitly unlock them — typically by generating characterisation tests first, confirming they pass, and then removing the deny rule for the specific module. The discipline becomes mechanical rather than aspirational. The hook says no on your behalf at exactly the moment the agent decides to be helpful.

Coverage honesty: line coverage lies

There's a subtle trap once the characterisation tests are in place: an agent (or human) can produce tests that exercise every line without actually asserting on the behaviour that matters. Line coverage will look great. The safety net will have holes.

Mutation testing is the honest measurement. The tool deliberately mutates the production code — flips a comparison, removes a return, inverts a boolean — and re-runs the test suite. If a mutant survives (no test fails), the suite missed a behavioural assertion. The surviving-mutant count is the real coverage gap.

The toolchain depends on your stack:

JavaScript / TypeScript / C# / Scala — Stryker, current release v9.0.1 (May 2025). Tagline: "Test your tests with mutation testing." Note: contrary to what some references suggest, Stryker does not support Java.
Python — Cosmic Ray, active beta from sixty-north. Mature enough for production use on individual modules.
Industrial scale — Meta's Automated Compliance Hardening (ACH), published September 2025 by Mark Harman, uses LLMs to generate compliance-targeted mutants and matched tests. In an October–December 2024 trial across Facebook, Instagram, WhatsApp, and Meta wearables, "73% of the generated tests were accepted by engineers." That number is the upper bound for what's realistic when LLMs author the mutation suite.

For a brownfield refactor in agent hands, the playbook is:

Generate characterisation tests with the agent.
Run mutation testing on the new tests. Aim for a mutation score above 80% on the module being refactored.
Only after step 2 passes, remove the Edit(...) deny rule for that path.
Let the agent refactor. The tests now hold the contract.

Steps 1 and 2 cost a few minutes of agent time on a typical small module. Step 4 is the part that used to be terrifying. The first two steps make it boring.

What this is really doing

The brownfield over-refactor anti-pattern is what happens when an agent is allowed to be opinionated about code that has no defended contract. It's the same boundary failure I've written about elsewhere as the gap between vibe coding and agentic engineering — prototype-mode looseness leaking onto code that other people depend on. The fix isn't to make the agent more cautious — well-tuned agents are still over-eager, that's their job. The fix is to defend the contract first, then let the agent be as opinionated as it likes within the boundary.

Feathers wrote Working Effectively With Legacy Code before AI agents existed. Twenty years later it remains the cheapest piece of legacy-code discipline you can adopt — and the first time we've had a tool that makes writing the characterisation tests fast enough that there is no excuse not to.

Don't let an agent touch code you can't roll back to. Generate the safety net first. Mutation-test it to make sure the net has no holes. Then refactor with both hands free.

What Feathers actually said

Michael Feathers, Working Effectively With Legacy Code (Prentice Hall PTR, 2004, ISBN 978-0131177055), gave us the technique. The definition from Chapter 13 is one sentence:

"A characterization test is a test that characterizes the actual behavior of a piece of code."

The agent is the perfect characterisation-test writer

The prompt template I use is short and boring:

Read the function at <path>:<line>. Write a complete set of tests
that capture its current behaviour exactly. For each test:
 - Generate inputs that exercise a meaningful branch or edge.
 - Run the function and record the actual output verbatim.
 - Assert on that exact output — quirks, bugs, and all.
Do NOT correct anything you think looks wrong. The job is
to pin down what the code does today.
Cover: typical inputs, boundary inputs, null/undefined,
empty collections, malformed inputs, and any obvious overflow.

typescript

import { mapPostcodeToZone } from "./delivery";

describe("mapPostcodeToZone (characterisation)", () => {
  it("returns CENTRAL for 049315", () =>
    expect(mapPostcodeToZone("049315")).toBe("CENTRAL"));
  it("returns EAST for 460001", () =>
    expect(mapPostcodeToZone("460001")).toBe("EAST"));
  // Quirk: unknown postcodes silently return WEST.
  // This is almost certainly a bug — pin it down anyway.
  it("returns WEST for an unknown postcode", () =>
    expect(mapPostcodeToZone("999999")).toBe("WEST"));
  it("returns WEST for empty string (no validation)", () =>
    expect(mapPostcodeToZone("")).toBe("WEST"));
  it("throws for null input", () =>
    expect(() => mapPostcodeToZone(null as unknown as string)).toThrow());
});

Lock the agent out until the safety net is in place

The harness layer earns its keep here. Add a permission deny rule to your settings.json for any module the agent isn't allowed to touch:

json

{
  "permissions": {
    "deny": [
      "Edit(src/legacy/billing/**)",
      "Edit(src/legacy/dates/**)"
    ]
  }
}

Coverage honesty: line coverage lies

The toolchain depends on your stack:

JavaScript / TypeScript / C# / Scala — Stryker, current release v9.0.1 (May 2025). Tagline: "Test your tests with mutation testing." Note: contrary to what some references suggest, Stryker does not support Java.
Python — Cosmic Ray, active beta from sixty-north. Mature enough for production use on individual modules.
Industrial scale — Meta's Automated Compliance Hardening (ACH), published September 2025 by Mark Harman, uses LLMs to generate compliance-targeted mutants and matched tests. In an October–December 2024 trial across Facebook, Instagram, WhatsApp, and Meta wearables, "73% of the generated tests were accepted by engineers." That number is the upper bound for what's realistic when LLMs author the mutation suite.

For a brownfield refactor in agent hands, the playbook is:

Generate characterisation tests with the agent.
Run mutation testing on the new tests. Aim for a mutation score above 80% on the module being refactored.
Only after step 2 passes, remove the Edit(...) deny rule for that path.
Let the agent refactor. The tests now hold the contract.

Steps 1 and 2 cost a few minutes of agent time on a typical small module. Step 4 is the part that used to be terrifying. The first two steps make it boring.

What this is really doing

Don't let an agent touch code you can't roll back to. Generate the safety net first. Mutation-test it to make sure the net has no holes. Then refactor with both hands free.

Characterisation Tests Before Agents Touch Brownfield Code

What Feathers actually said

The agent is the perfect characterisation-test writer

Lock the agent out until the safety net is in place

Coverage honesty: line coverage lies

What this is really doing

Related

The 30 Principles for Agentic Engineering — Part 3: The Harness

The 15-Tool-Call Rule: Where Agent Quality Falls Off a Cliff

Standardise the Harness, Customise the Work: The 5-Layer Agent Architecture

Characterisation Tests Before Agents Touch Brownfield Code

What Feathers actually said

The agent is the perfect characterisation-test writer

Lock the agent out until the safety net is in place

Coverage honesty: line coverage lies

What this is really doing

Related

The 30 Principles for Agentic Engineering — Part 3: The Harness

The 15-Tool-Call Rule: Where Agent Quality Falls Off a Cliff

Standardise the Harness, Customise the Work: The 5-Layer Agent Architecture

Practical AI engineering, in your inbox

Related

The 30 Principles for Agentic Engineering — Part 3: The Harness

The 15-Tool-Call Rule: Where Agent Quality Falls Off a Cliff

Standardise the Harness, Customise the Work: The 5-Layer Agent Architecture

Practical AI engineering, in your inbox

Related

The 30 Principles for Agentic Engineering — Part 3: The Harness

The 15-Tool-Call Rule: Where Agent Quality Falls Off a Cliff

Standardise the Harness, Customise the Work: The 5-Layer Agent Architecture