On this page Directed development

I have been building with AI coding agents for over a year.1 Most of that time was unstructured: I used whatever worked, fixed problems as they came up, and did not pay much attention to the patterns. Then, starting around November 2025, I spent three months building a mostly personal internal autodiff library: graph-based, eagerly traced, NumPy-integrated, and almost entirely agent-written. By the end, it had reached roughly 80,000 lines of Python.

I wanted a graph-first, NumPy-native library where the graph itself was explicit, inspectable, and available for transformations beyond autodiff. JAX is excellent, but it optimizes for a different set of tradeoffs: staged execution, JIT compilation, and the accelerator stack.

Over those three months, my role shifted from reviewing every line of agent output to opening many sessions with a single question: “what should we work on next?” That shift did not happen only because the agents improved. It happened because the verification infrastructure improved enough that I could trust the floor: a minimum quality bar would hold regardless of what the agent produced.

This article is about that infrastructure — what it looks like, how it grew, and why it matters more than you might think. But the infrastructure did not build itself. Every check exists because a human saw a failure and decided to formalize the fix. The code is not open source, so the point of this article is not the artifact itself, but the verification patterns and harness design that emerged while building it.

Directed development

This library did not start with agents. I had been sketching it on the side for a while — a design document, a basic tracing engine, some scaffolding. By the time I opened the first AI session, the repo had about a thousand lines of Python and a clear picture of what the library should be. What it did not have was momentum. A side project I picked up between other work, never for long enough to get past the foundation.

The first session was a comparison: how did the library stack up against autograd and MyGrad?2 I was using the agent as a reviewer, asking for architectural opinions. When the agent got the target API wrong — proposing explicit tracing contexts when the library was designed for implicit tracing — I corrected it. The agent was a consultant. I was the authority.

The second session set up CI: ruff with all rules enabled, mypy strict, Hypothesis property-based testing, conventional commits, and AGENTS.md.3 None of this was about agent harnesses; I did not even know the term yet. Engineering discipline and personal curiosity, which turned out to benefit agents enormously.

What followed was a burst of rapid development. Over a handful of intense sessions, the codebase grew from a small prototype into something much larger: graph optimization passes, workflow orchestration, xarray integration, debugging and visualization tools. But the derivative engine remained the most verification-intensive part, and the one that drove most of the harness. My workflow settled into a ritual:

“review the current state of the repo, what do you think would be highest leverage to work on next?”

Let the agent propose priorities. Select or redirect. Ask for a detailed implementation plan with user stories, test strategy, and acceptance criteria. Review it. Paste it back with: “PLEASE IMPLEMENT THIS PLAN.”4

Where agents drift

The workflow was productive but not self-correcting. Pushing back mattered — and I had to push back constantly.

On VJP coverage, the agent claimed near-complete parity with autograd. I was skeptical; autograd had far more VJPs than that.5 I was right — the comparison was incomplete. Compatibility shims revealed a similar pattern: the agent kept proposing backward-compatibility layers for a greenfield project with no users.

The shims deserve their own mention because they are so universal. Old behavior kept via fallbacks and re-exports, new features built on top, instead of clean breaks. Even with explicit instructions, the agents reached for shims.6 If you are building anything with coding agents, you will hit this.

Files bloated too. Core modules grew past 1,800 lines. Test files reached similar sizes. A vicious cycle: longer files fill agent context faster, the agent gets worse at navigating the codebase, the code gets worse, the files grow further. The 500-line file limit I eventually imposed came directly from that pain.

Pushing back

One moment captures why human judgment still matters in this workflow.

Autodiff frameworks need derivative rules for every operation, typically written as either VJPs (reverse-mode pullbacks) or JVPs (forward-mode pushforwards).7 This library had accumulated reverse-mode rules first, like most frameworks. In the middle of a session about expanding derivative coverage, the agent proposed deriving JVP rules from the existing VJP rules. That sounded reasonable.

Then I asked the question that changed the architecture: “Doesn’t it scale better to make JVPs the default source of truth and derive VJPs from them where possible, rather than vice versa?”

For this codebase, yes. JVPs were the cleaner authoring primitive for much of the covered NumPy surface, and reverse-mode pullbacks could often be synthesized through the transpose machinery. Some operations still needed explicit reverse-mode exceptions for correctness or performance, but the default direction was backwards from what the agent proposed. Roughly twenty commits in a single day shifted the covered NumPy surface to a JVP-first policy.

Later that evening, I checked the agent’s work again: if JVPs were now the default source of truth, why did the rule layout still look overwhelmingly reverse-mode? The file structure told a different story than the runtime claims. The agent had wired the runtime correctly, but too much of the old reverse-mode formula structure was still in place. It took another full session — running overnight, largely autonomously — to make the migration real in the authored rule layout, not just the plumbing.

A well-defined mathematical problem, clear correctness criteria, and a human catching both the architectural direction and the incomplete execution. If I hadn’t asked, I probably wouldn’t have caught it until the derivative layer was deeply entrenched — and unwinding it would have been painful.

The proto-harness

The JVP migration went (reasonably) well because the problem was mathematically well-defined. Most problems aren’t. Every piece of verification infrastructure that followed was a response to something that went wrong.

The first tool was pre_pr.sh — a small shell script born from watching the agent skip steps with each PR. It was the “unenforced verification” failure mode from the first article, playing out in real time. The agent would forget to run mypy, or skip the VJP coverage check, or not rebuild docs. The script was my first attempt at “one command to verify everything”: check for a clean working tree, rebase on main, run the linter, type checker, tests, coverage checks, grad contracts, and docs build, in sequence. If any step failed, the whole thing failed.

The shell script could not keep up, so I replaced it with quality.py — a Python CLI consolidating the scripts into a single framework.

But the commands to run verification differed across AGENTS.md, CI workflow files, and docs.8 And nothing was scoped: every check ran against the full codebase regardless of what changed.

Each tool was a response to pain.

Boundaries

The architecture split into layers: a backend-agnostic core, a numpy integration layer, and a grad package for autodiff. The intent was clean separation — core should have zero knowledge of NumPy, so adding a CuPy or JAX backend later would be more tractable. It also gave agents clear lanes to work in.

On the day this layered architecture was declared, the boundary checker caught its first violation within hours. Then another. The architecture was established, violated, fixed, violated again, and fixed again. The original checker only caught cross-layer internal-package imports; it missed bare import numpy entirely, so I had to extend it, again and again.

The boundary checker ran continuously, but it only caught what it knew to look for. Ten weeks later, I manually scanned what the agent had built in the gradient package and found dozens of files with import numpy as np, well over a thousand np.* callsites, and a package that would not run without numpy installed even though its pyproject.toml declared no numpy dependency. After a big-bang refactor, the agent introduced a backend-agnostic proxy module — but exported the proxy object as np, not xp. Those files then did from ..._backend_runtime import np. Syntactically different from import numpy as np. Visually identical.

Each round taught me to probe deeper. Was there any code in core that imported numpy, even lazily? Any mention of numpy in core, even as a variable name or string literal? There were — hardcoded "numpy" string literals, prefix-coupled registration, multiple compatibility wrapper modules. Every question I learned to ask was a check I should have automated earlier.

Curating context

Boundary enforcement was about rules within sessions. The next problem was continuity between them.

I tried to scale development with sub-agents. In Claude Code, I built seven specialized agents: dispatch, test-gen, quality, architect, numpy-protocol, docs, debug. An agent team, each specialized on a domain. It did not stick. I burned a lot of tokens, but the output quality was arguably worse than if I had stuck with a single agent and manual steering. I think the problem was context: each agent starts from only a prompt and has to derive all its context from there. Handoff documents either included too little — and the receiving agent made wrong assumptions — or too much — and the agent could not distinguish signal from background.

What did work was something simpler. Near the end of a long session, I was about to ask the agent to continue with the next phase. Instead, I asked it to write the prompt I should use for that next phase.

The agent generated a detailed handoff prompt — project state, remaining work, constraints, validation commands — that I would then curate and paste into a fresh session. It was a “relay” pattern, where the prompt is a compressed representation of what matters: what was just done, what remains, what constraints apply.9

Within days, I was running multiple agents in parallel. Each session got its own git worktree.10 I broke tasks down into work streams and dispatched them: “Give me a prompt for each of these work streams, and tell me which ones I can kick off in parallel.” Without quality.py running in each worktree, parallel agents would have been utter chaos. But now, each agent could independently verify its own changes, enabling a workflow that would not have been sustainable otherwise.

Properly, this time

Still, the quality gate turned red.11 The jvp_grad_runtime_ratio had drifted above the 10.0x threshold because of measurement noise on higher-order workloads. It was effectively blocking merges unrelated to performance.

Around the same time, “harness engineering” was becoming a more explicit frame for what many teams were converging on. OpenAI had published their article on the topic, and others were describing similar patterns.12 The general idea is simple: agent reliability comes from the environment, not just the model, and verification infrastructure deserves primary investment. Seeing the pattern named made the next step clear.

I took a day off — no sessions, no commits. Then I decided to do it properly.

The next session started with a single instruction: review the repo in light of OpenAI’s harness engineering article and suggest how it should be restructured. That kicked off a hard cutover. pre_pr.sh and quality.py gave way to a dedicated harness, diff-scoped mutation testing, and JSON output with bounded content for agent-friendly context windows. That was the harness turning point.

Loop, mutate, gate

The harness is a CLI that wraps the repo’s verification into three progressively broader commands:

uv run python scripts/harness.py loop          # fast scoped, 180s budget
uv run python scripts/harness.py mutate        # diff-scoped mutation
uv run python scripts/harness.py gate          # full blocking merge gate

The agent no longer needs to know what verification steps exist or where they live. It runs a command and gets a go or no-go result.

Harness rhythm

Click through the workflow. The map shows the order of the loop; the panel explains what each stage contributes and why the agent could finally rely on it.

Tap a stage to inspect it.

Failure path

Any failed verification step sends the session back to editing. The loop is only useful if the failure path is explicit and unavoidable.

Step 02

Fast, scoped verification

harness loop

Run the cheapest checks against the actual blast radius of the change, not the whole repo.

This is the inner loop. It expands through dependencies, runs only the relevant checks, and aims to finish quickly enough that the agent can use it during implementation rather than only at the end.

  • Lint, type checks, tests, and small quality gates stay inside a tight budget.
  • If this step is slow, the agent starts deferring it instead of relying on it.
Harness rhythm. Fast, scoped verification after each edit, then mutation pressure, then the full blocking gate before merge.

loop is the tight inner cycle. It figures out what changed, expands through a dependency graph to find affected packages, and runs only the relevant checks — lint, type-checking, tests, quality gates — under a 180-second budget. If it runs out of time, it defers remaining checks rather than failing. In practice, this changed the agent’s behavior: instead of running the full test suite after every edit — or worse, running nothing until the end — the agent started running loop after each logical change, catching issues while the context was still fresh.

mutate answers a different question: do the tests actually verify the changed behavior, or do they just execute it?

gate is the full merge requirement. All checks, no scoping, no budget. Everything must pass.

At the time of writing, the harness runs dozens of checks across those three commands. Some of the most useful exist because of specific agent behaviors. The suppression guard blocks new # type: ignore and # noqa annotations — without it, the agent’s first instinct when a type check fails is to suppress the error rather than fix it. The import boundary checker enforces architectural layering with AST-level analysis.

Every command returns a JSON envelope with bounded content — at most 20 checks, 8 details per check, 240 characters per detail — so the agent’s context window is not flooded with log output. A typical response looked like this (simplified):

{
  "ok": false,
  "command": "harness loop",
  "result": {
    "scope": { "expanded_scopes": ["core", "grad", "numpy", "..."] },
    "checks": [
      {
        "id": "pytest_scoped",
        "status": "fail",
        "details": ["FAILED test_tracer.py::test_record - AssertionError"]
      }
    ],
    "summary": { "total": 5, "passed": 4, "failed": 1 }
  },
  "next_actions": [{ "command": "harness loop", "description": "Re-run after fix." }]
}

The next_actions field is context-aware: after a successful loop, the harness suggests mutate; after a failure, it suggests the specific post-fix command.

None of these techniques are new individually. Scope-aware test selection exists in Bazel, Nx, and plenty of CI tools. Mutation testing has been around for decades.13 The pieces needed to be wired together in a specific way: JSON output bounded for agent context windows, a progression from fast-and-scoped to slow-and-complete, and every check degrading toward strictness when anything is uncertain.14

What to check

The core mechanic that makes the harness practical is scope resolution — turning “what files changed” into “what checks to run.” It starts with git diff origin/main to get changed paths, matches each against a path map that routes file paths to one of nine package scopes, then expands through a dependency graph using BFS. In simplified form, the config looked like this:

[scope.path_map]
"packages/core/"            = "core"
"packages/numpy/"           = "numpy"
"packages/grad/"            = "grad"
"packages/grad_numpy/"      = "grad_numpy"
"packages/xarray/"          = "xarray"

[scope.dependencies]
core       = []
grad       = ["core"]
numpy      = ["core"]
grad_numpy = ["core", "grad", "numpy"]
xarray     = ["core", "numpy"]

Change a file in packages/core and BFS expands to all dependent packages. Change only packages/xarray and only xarray checks run. Anything unrecognized — unmapped paths, missing merge-base, diff failures — falls back to the full gate.

This is what enables the 180-second loop budget. Without scoped checks, every change triggers every check — too slow for a tight agent loop. With scoping, a change to one leaf package triggers seconds of verification, not minutes.

No survivors

Coverage measures execution. Mutation testing measures verification.

An agent can (and will) write a test like this:

def test_gradient_scaling():
    result = scale_gradient(x, factor=0.5)
    assert result is not None  # 100% coverage, 0% verification

That test covers every line of scale_gradient. But flip factor > 0 to factor >= 0 inside the function, and the test still passes. It is not testing the behavior it claims to cover.

Diff-scoped mutation testing

Treat this as a miniature interactive pipeline. The active stage explains how the check narrows from a whole PR down to one hard question: would the tests notice a real behavioral change?

Tap a stage to inspect it.

Step 05

Any survivor?

decision

This is the real question the whole pipeline is driving toward.

A surviving mutant means the tests still pass after a real behavioral change was injected. That is not a coverage problem. It is a trust problem.

  • The branch decides whether the change is ready or still under-specified.
  • This is where weak AI-written tests get exposed.
Mutation pressure. Narrow the check to changed, executed lines, then treat every surviving mutant as a blocking signal rather than a coverage statistic.

The idea came from a Slack conversation with Frederik: mutation testing is expensive on the whole codebase, but what if you scope it to the diff? Mutating only the five to twenty lines you just changed makes the cost manageable. The pipeline: find changed source lines via git diff, filter to lines actually executed by tests via coverage, generate AST-level mutations (comparison flips, boolean inversions, arithmetic swaps, constant flips), select at most twelve mutants with breadth across files, and run the relevant tests against each. If the tests still pass after a mutation, the mutant survived — the tests do not verify the behavior they claim to cover.

The policy is strict: 100% kill rate on changed lines. If any mutant survives, the PR is blocked. And require_changed_tests = true adds another constraint: if you change runtime code, you must also change or add tests. No silent runtime changes. The mutation check then verifies that those tests actually catch real behavioral differences, not just execute code paths.

Pure refactors and equivalent mutants occasionally cause friction. The tradeoff is worth it when the entity writing your tests is an AI that optimizes for making them pass rather than making them meaningful. Diff-scoping is what makes it practical. Full-codebase mutation testing is a research project. Diff-scoped mutation testing is a CI check.

The new normal

What changed after the harness is not that the agent writes better code. What changed is that bad code gets caught — missing tests, broken boundaries, skipped steps, silent regressions — before it compounds.

Across 287 commits from November 27, 2025 to February 24, 2026, with February 16 as the harness cutover, the share of commits touching test files rose from about 41% to 76%, consistent with require_changed_tests.15 The fix ratio held roughly flat at about 11%. Commits also got larger — averaging about 1,187 insertions versus 668 pre-harness — suggesting more confidence in landing bigger changes when the harness catches mistakes.

But the more interesting change was behavioral. I had always asked the agent to propose priorities — that ritual started early. But before the harness, I treated the proposals as suggestions and drove each session with specific goals: “implement this JVP,” “fix the module layout,” “split these files.” After, I could actually follow the agent’s lead. Most sessions started the same way — “what should we work on next?” — and this time I meant it. The default shifted from “I tell you what to build” to “you tell me what needs building, and I decide whether to approve.”

That shift felt strange the first few times. Asking an AI “what should we work on?” and actually trusting the answer requires believing the floor will catch whatever goes wrong. The harness was that floor. It could not prevent the agent from writing mediocre abstractions or introducing unnecessary complexity — architectural judgment still requires a human. But it could enforce that tests existed, that they verified behavior, that imports respected boundaries, that type annotations were real16, that suppressions were not growing. The minimum quality was no longer me. It was the tooling.

The relay pattern from the earlier sprints would not have scaled without it. Parallel agents exacerbate every problem a single agent has — more drift, more skipped steps, more context confusion — and the harness was what kept them honest.

When something went wrong, the response changed too. Finding a bug in np.pad tracing, I did not just ask for a fix. I also asked why it slipped through, what would have prevented it, and how the instructions or harness should change so the same class of bug would not recur. Every bug became a harness improvement opportunity. After feature sessions, I started asking: was the harness useful? what did it catch? did mutation tests surface anything? They almost always had — mutation testing reliably catches weak tests that the agent wrote to pass, not to verify. The system learned from its failures, not through any kind of machine learning, but through a human treating each failure as evidence that a check was missing.17

What transfers

This implementation is purely Python, and that shows in the choice of technologies. AST-based mutation testing, pytest-driven coverage, ruff and mypy as lint and type gates — these are ecosystem-specific. The implementation is shaped by its context. But the exercise transfers: figure out what your agent keeps getting wrong, build a check for it, and wire that check into a loop the agent cannot skip. The specific checks will differ. The discipline should not.

Start with the check that would have caught the last thing that went wrong. The harness will grow from there.

I built this library intentionally as a learning experience for myself — paying attention to the patterns, documenting what went wrong, formalizing each fix. I suspect much of it will be familiar to anyone who has spent real time building with agents. The stacks and checks will vary. The habit should not: when the agent fails, turn the failure into a constraint, a check, or a better handoff, then keep going.

Footnotes

  1. This article was written with AI assistance. I dictated raw thoughts using dictation software, then worked with Claude to turn them into prose — iterating paragraph by paragraph and pushing back whenever something did not sound like me. The research was AI-assisted too: I used a local transcript indexing and search tool, and had agents crawl through three months of git history to surface the timelines and stats behind the claims here. The result is more thoroughly researched than what I would have produced on my own, which is part of the point.

  2. Autograd by Maclaurin, Duvenaud, and Johnson is the original NumPy autodiff library — elegant, influential, and the reason most of us know reverse-mode AD can feel native to Python. I had used it extensively, and it was the main reference point in my head. MyGrad by Ryan Soklaski takes a different approach — a Tensor object with NumPy ufunc/function overrides rather than autograd’s tracing tape — and its Hypothesis-heavy testing style directly influenced this library’s.

  3. Autodiff is the perfect application for property-based testing — you can express real mathematical invariants (gradient correctness, chain rule composition, forward/reverse agreement) and let the framework try to break them.

  4. The plans were remarkably specific. A typical one would include exact file paths, function signatures, test case names, and threshold constants — for example, AUTO_MIN_NODES_FOR_ANY_OPT = 128, AUTO_MIN_NODES_FOR_CSE = 512. The agent produced these from its analysis of the codebase; I reviewed and approved.

  5. Autograd was the reference point I had in mind when I was sanity-checking the agent’s parity claim.

  6. My best guess at the cause: post-training fine-tuning rewards “safe” code — strong compatibility guarantees, no breaking changes, defensive patterns. The training data is overwhelmingly production code with real users, where preserving backward compatibility is the right default. The model has no way to distinguish that context from a greenfield repo with zero consumers. Sensible instinct, wrong situation.

  7. For readers who want to dive deeper: JAX’s Autodiff Cookbook is an excellent introduction to forward- and reverse-mode differentiation, and their Custom derivative rules page explains exactly the JVP-first approach adopted here — define a JVP rule, and the framework derives VJPs automatically by transposing the linear computation.

  8. The naive question is: why not just keep them in sync? In practice, a growing codebase accumulates multiple places that encode the same information — agent instructions, CI configs, contributor docs, READMEs — and no matter how explicit the instructions are, they drift. Each source gets updated in its own context, by a different session or a different agent, and nobody notices the divergence until something breaks. This is hard to stay on top of even with human contributors; with agents that read whatever file they find first, it compounds fast. The harness solved this by making the single CLI the only source of truth — AGENTS.md says “run harness loop,” CI runs “harness gate,” and neither needs to enumerate individual steps.

  9. Both Claude Code and Codex CLI had automatic context compaction by this point — summarizing conversation history when the context window fills up. The relay pattern solves a different problem. Auto-compaction tries to preserve everything; the relay deliberately discards, starting a fresh session with only what the human judges relevant. The compression is lossy by design — that is the point.

  10. Around this time, I shifted most implementation work from Claude Code to Codex. Different tools for different strengths — Codex sessions averaged five hours for heavy, instruction-driven implementation; Claude Code sessions averaged under two hours for analysis, architecture review, and focused tasks.

  11. This library’s graph-based tracing is inherently heavier than autograd’s flat tape — you pay for node allocation, edge management, and scope tracking on every operation. Early benchmarks showed very large overhead. A dedicated performance sprint across several parallel workstreams brought this down significantly, and a 10x ratio was set as the acceptable threshold: slow enough to reflect the architectural cost, fast enough to be usable. The threshold was wired into quality.py as a blocking gate — which worked until measurement noise on higher-order workloads pushed it past 10x on runs where nothing performance-related had changed.

  12. The term gained traction in early 2026. Mitchell Hashimoto’s “My AI Adoption Journey” described the practice of engineering a solution for every agent mistake so it never recurs — each line in his AGENTS.md traced to a specific past failure. OpenAI’s “Harness engineering” made the case at scale using Codex. Anthropic demonstrated it by having sixteen parallel Claude agents build a C compiler in Rust. Cursor showed what happens at the extreme — agents building a browser from scratch, running unattended for a week. Martin Fowler’s analysis provided conceptual framing. Can Duruk’s “The Harness Problem” argued the harness is the bottleneck, not the model.

  13. Google runs diff-based mutation testing on every code change to its monorepo, using the same core idea: generate mutants only in changed lines, use coverage data to select relevant tests, suppress unproductive mutations. Their system serves tens of thousands of developers. See Petrovic and Ivankovic, “State of Mutation Testing at Google” (ICSE 2018).

  14. This is the “fail closed” principle from the first article: when the system cannot determine whether something is safe, it should assume it is not. Every ambiguity resolves toward more checking, not less.

  15. “Touched test files” means the commit’s diffstat includes at least one file under a tests/ directory. “Fix” commits are classified by conventional commit prefix (fix:). Insertions are raw git log --stat totals.

  16. In a typed codebase, agents do not have to guess what a function expects or returns. Types are how agents navigate a large codebase without reading every implementation. In Python that discipline is optional rather than enforced by the language, so tools like mypy strict have to carry the load.

  17. This extended to the agent instructions themselves. Early on, I established a rule that AGENTS.md should self-update: if an agent followed a rule and it still led to the wrong outcome, the rule needed to be refined. The instructions file became a living document maintained by the agents who consumed it.