On this page Tolerance: How much drift can you afford?

Consider two recent experiments with coding agents. Similar ambition. Opposite outcomes.

In the first, a team pointed hundreds of agents at a browser project. In a week they produced roughly a million lines of code. By coordination metrics it was a success: parallel work, lots of merged PRs, visible throughput. But when the project went public, outside observers pointed to failing CI and questioned how much of that visible throughput translated into a clean, working system.

In the second, as described in Claude is not a senior engineer (yet), a single engineer connected Claude to an automated browser testing suite (Playwright) and an error monitoring tool (Sentry). The agent wrote code, ran the tests, read the error traces, and fixed its own bugs. Ninety minutes later, it worked.

I am not trying to offer a definitive postmortem on either case. I am using them as contrasting examples of a broader engineering pattern.

The pattern is a standard engineering concept: tolerance.

Tolerance: How much drift can you afford?

Mechanical engineering abandoned binary “works/doesn’t work” thinking decades ago. A bridge doesn’t just “work”. It tolerates a specific load variance under specific conditions. We ask about allowable margin of error, i.e., the acceptable error band around the ideal.

Software engineering has tolerances, too.

Tolerance isn’t one number. In software it decomposes into dimensions like correctness, security, latency, cost, reversibility, and blast radius (think error budgets). UI copy may be flexible on exact wording but not on brand tone. A refactor may tolerate new implementation details but not behavior changes.

Some tasks are high tolerance: exploratory prototyping, quick internal tools, one-off scripts, early drafts. Drift is acceptable because the goal is discovery and speed.

Other tasks are low tolerance: production infrastructure, security boundaries, customer-facing behavior, billing and permissions. Here “close enough” isn’t a solution. It’s a failure that may not show up immediately, but will surface later as incidents and churn.

The difference between those two agent experiments was that one treated a low-tolerance problem with a high-tolerance process.

That mismatch shows up as a control problem.

Open Loops vs. Closed Loops

In an open loop, the agent writes code and opens pull requests, but verification bottlenecks at human review minutes, hours, or days later. The delay between action and verification lets error accumulate. Drift becomes visible only after it’s expensive.

In a closed loop, the agent makes a change and immediately runs verification against hard constraints. The loop itself damps error. Closed loops require fast, reliable feedback; slow or flaky verification re-opens the loop.

The Two Experiments Compared

Browser Project (Open Loop)LLM + Tests + Traces (Closed Loop)
Feedback signalInformational (throughput, mergeability, activity)Structural (tests pass, errors resolved)
Verification timingAfter the fact, by humansEvery turn, by the agent
Termination conditionPR mergedConstraints satisfied
OutcomeImplied success; required human fixesWorking fix in 90 minutes

Read this way, the browser project optimized for coordination metrics, while the Claude setup optimized for correctness under feedback. The point is not that many agents are inherently bad. The point is that open loops amplify drift when verification is weak. Agent reliability isn’t magic model behavior, it’s an environment where correctness is continuously verified.

But verification has a precondition: success must be expressible as constraints the agent can actually verify.

For a picture of what the browser experiment could have looked like with human coordination and a tight loop, see One Human + One Agent = One Browser From Scratch.

The Ambiguity Gap

The fundamental challenge with agents is that intent is latent (in your head), while evidence is explicit (text documents in the repo).

Ambiguity is the distance between intent and evidence.

When you delegate to a human engineer, they bridge that gap with judgment: they ask clarifying questions, infer missing context, notice anomalies, and sanity-check against domain knowledge.

When you delegate to an AI agent, it can’t feel that gap. It needs measurable constraints to know whether it has actually crossed from “plausible output” to “correct result.”

Without constraints, the agent will still do as instructed. It will just optimize for the easiest proxy it can satisfy: producing output, closing tickets, merging PRs. Not correctness and integration.

A practical predictor of these outcomes is:

Can you describe success in terms of constraints the agent can verify?

If you can, agents compound your effort. If you can’t, you’re doing exploration, and you should treat the output as exploration.

Four Failure Modes (and their fixes)

When agents appear unreliable, it’s usually a failure of the surrounding system design rather than the model itself. We see four common patterns.

  1. Undefined Specs

    You have intent, but no mechanism to verify it. The requirements are unsettled, or the definition of done is just a feeling, like “make onboarding feel simpler.”

    Fix: Don’t delegate the decision-making. Use the agent to prototype and explore, but treat the output as raw material that helps you write the spec, not as the final product. If you don’t know what done looks like, the agent won’t either.

  2. Hidden Context

    The constraints exist, but they’re trapped in a meeting note or a Slack thread. Unlike undefined specs, the spec exists here. It just isn’t where the agent can read it. Think of the edge-case permission rule that came up once in discussion but never made it into the repo.

    Fix: Treat context as code. If a constraint isn’t captured in versioned, linkable artifacts (AGENTS.md, RFCs/ADRs, schemas), it doesn’t exist for the agent.

  3. Unenforced Verification

    The specs exist and are accessible, but the agent isn’t forced to check them. Tests are nice to have. CI failures don’t block merges. The system rewards speed or volume over correctness. The result is a workflow where “it probably works” is treated as progress.

    Fix: Verification must be a termination condition. CI gates must fail closed. Pre-commit hooks tighten the loop. If the tests don’t pass, the agent hasn’t finished.

  4. Inadequate Constraints

    The agent is verifying, but the constraints are too weak or too game-able. Tests pass, yet the system is still wrong: coverage is thin, assertions encode the wrong intent, or non-functional requirements (performance, security, UX) aren’t represented. For example, unit tests stay green while latency quietly doubles.

    Fix: Widen the constraint surface. Add invariants and golden tests for critical flows, static analysis (types, linters), and where it matters, property tests/fuzzing. For production-adjacent changes, pair verification with observability and rollback criteria.

Deciding What to Delegate

As models get smarter and faster, and context windows expand, the temptation is to throw them at larger, fuzzier problems. But a smarter, faster agent in a fuzzy environment mostly produces the wrong thing faster. It cannot know what it cannot read.

To decide whether to delegate, ask:

  • Can the agent verify success on its own?
  • How much drift can you tolerate if it gets the answer slightly wrong?

That gives four cases:

  1. Verifiable, high tolerance. Let it run and spot-check. Examples: generating release notes from merged PRs; drafting meeting notes.
  2. Verifiable, low tolerance. Delegate with gates. Examples: fixing a failing unit test; fixing a customer-reported bug.
  3. Not yet verifiable, high tolerance. Use the agent for exploration, then extract constraints from what you learn. Examples: exploring UI layouts for a new feature; brainstorming marketing copy.
  4. Not yet verifiable, low tolerance. Don’t delegate the decision yet. First use the agent to produce the artifacts that make the work verifiable. Examples: draft a permission matrix, define invariants, write escalation-path tests, prototype policy-as-code.

High-level work (architecture, strategy, trade-offs) often starts in the hardest case: low verifiability, low tolerance. Assumptions hide best there. But high-level work is usually decomposable. Break it into constrained subtasks, then delegate those.

For example, “defining multi-tenant permissions” is low tolerance and low verifiability at the start. Don’t delegate the decision; delegate the work of making it verifiable: draft a permission matrix and invariants, write tests for escalation paths, prototype a policy-as-code layer. Once those constraints exist, implementation becomes a low-tolerance but verifiable task.

The Acceleration of Debt

None of this is new engineering wisdom. What changes with agents is the rate at which small omissions compound.

Humans bridge gaps socially: they ask questions, notice contradictions, remember “that one incident from last year,” and hesitate when something feels off. Agents don’t get those dampeners. They will happily produce plausible work until the system forces contact with reality.

That’s why what used to be technical debt becomes context failure. If a constraint (decision records, schemas, invariants, style guides, runbooks) isn’t captured in the repo, it effectively doesn’t exist for an agent. Treat context as code, and treat verification as the termination condition, not a suggestion.

The future workflow isn’t exotic. It’s the old best practices, made load-bearing by speed. Start with the last thing your agent got wrong. Turn it into a constraint or a check. Wire it into the loop, then repeat.