On this page Tolerance: How much drift can you afford?
Consider two recent experiments with coding agents. Similar ambition. Opposite outcomes.
In the first, a team pointed hundreds of agents at a browser project. In a week they produced roughly a million lines of code. By coordination metrics it was a success: parallel work, lots of merged PRs, visible throughput. But when the project went public, outside observers pointed to failing CI and questioned how much of that visible throughput translated into a clean, working system.
In the second, as described in Claude is not a senior engineer (yet), a single engineer connected Claude to an automated browser testing suite (Playwright) and an error monitoring tool (Sentry). The agent wrote code, ran the tests, read the error traces, and fixed its own bugs. Ninety minutes later, it worked.
I am not trying to offer a definitive postmortem on either case. I am using them as contrasting examples of a broader engineering pattern.
The pattern is a standard engineering concept: tolerance.
Tolerance: How much drift can you afford?
Mechanical engineering abandoned binary “works/doesn’t work” thinking decades ago. A bridge doesn’t just “work”. It tolerates a specific load variance under specific conditions. We ask about allowable margin of error, i.e., the acceptable error band around the ideal.
Software engineering has tolerances, too.
Tolerance isn’t one number. In software it decomposes into dimensions like correctness, security, latency, cost, reversibility, and blast radius (think error budgets). UI copy may be flexible on exact wording but not on brand tone. A refactor may tolerate new implementation details but not behavior changes.
Some tasks are high tolerance: exploratory prototyping, quick internal tools, one-off scripts, early drafts. Drift is acceptable because the goal is discovery and speed.
Other tasks are low tolerance: production infrastructure, security boundaries, customer-facing behavior, billing and permissions. Here “close enough” isn’t a solution. It’s a failure that may not show up immediately, but will surface later as incidents and churn.
The difference between those two agent experiments was that one treated a low-tolerance problem with a high-tolerance process.
That mismatch shows up as a control problem.
Open Loops vs. Closed Loops
In an open loop, the agent writes code and opens pull requests, but verification bottlenecks at human review minutes, hours, or days later. The delay between action and verification lets error accumulate. Drift becomes visible only after it’s expensive.
In a closed loop, the agent makes a change and immediately runs verification against hard constraints. The loop itself damps error. Closed loops require fast, reliable feedback; slow or flaky verification re-opens the loop.
The Two Experiments Compared
| Browser Project (Open Loop) | LLM + Tests + Traces (Closed Loop) | |
|---|---|---|
| Feedback signal | Informational (throughput, mergeability, activity) | Structural (tests pass, errors resolved) |
| Verification timing | After the fact, by humans | Every turn, by the agent |
| Termination condition | PR merged | Constraints satisfied |
| Outcome | Implied success; required human fixes | Working fix in 90 minutes |
Read this way, the browser project optimized for coordination metrics, while the Claude setup optimized for correctness under feedback. The point is not that many agents are inherently bad. The point is that open loops amplify drift when verification is weak. Agent reliability isn’t magic model behavior, it’s an environment where correctness is continuously verified.
But verification has a precondition: success must be expressible as constraints the agent can actually verify.
For a picture of what the browser experiment could have looked like with human coordination and a tight loop, see One Human + One Agent = One Browser From Scratch.
The Ambiguity Gap
The fundamental challenge with agents is that intent is latent (in your head), while evidence is explicit (text documents in the repo).
Ambiguity is the distance between intent and evidence.
When you delegate to a human engineer, they bridge that gap with judgment: they ask clarifying questions, infer missing context, notice anomalies, and sanity-check against domain knowledge.
When you delegate to an AI agent, it can’t feel that gap. It needs measurable constraints to know whether it has actually crossed from “plausible output” to “correct result.”
Without constraints, the agent will still do as instructed. It will just optimize for the easiest proxy it can satisfy: producing output, closing tickets, merging PRs. Not correctness and integration.
A practical predictor of these outcomes is:
Can you describe success in terms of constraints the agent can verify?
If you can, agents compound your effort. If you can’t, you’re doing exploration, and you should treat the output as exploration.
Four Failure Modes (and their fixes)
When agents appear unreliable, it’s usually a failure of the surrounding system design rather than the model itself. We see four common patterns.
Undefined Specs
You have intent, but no mechanism to verify it. The requirements are unsettled, or the definition of done is just a feeling, like “make onboarding feel simpler.”
Fix: Don’t delegate the decision-making. Use the agent to prototype and explore, but treat the output as raw material that helps you write the spec, not as the final product. If you don’t know what done looks like, the agent won’t either.
Unenforced Verification
The specs exist and are accessible, but the agent isn’t forced to check them. Tests are nice to have. CI failures don’t block merges. The system rewards speed or volume over correctness. The result is a workflow where “it probably works” is treated as progress.
Fix: Verification must be a termination condition. CI gates must fail closed. Pre-commit hooks tighten the loop. If the tests don’t pass, the agent hasn’t finished.
Inadequate Constraints
The agent is verifying, but the constraints are too weak or too game-able. Tests pass, yet the system is still wrong: coverage is thin, assertions encode the wrong intent, or non-functional requirements (performance, security, UX) aren’t represented. For example, unit tests stay green while latency quietly doubles.
Fix: Widen the constraint surface. Add invariants and golden tests for critical flows, static analysis (types, linters), and where it matters, property tests/fuzzing. For production-adjacent changes, pair verification with observability and rollback criteria.
Deciding What to Delegate
As models get smarter and faster, and context windows expand, the temptation is to throw them at larger, fuzzier problems. But a smarter, faster agent in a fuzzy environment mostly produces the wrong thing faster. It cannot know what it cannot read.
To decide whether to delegate, ask:
- Can the agent verify success on its own?
- How much drift can you tolerate if it gets the answer slightly wrong?
That gives four cases:
- Verifiable, high tolerance. Let it run and spot-check. Examples: generating release notes from merged PRs; drafting meeting notes.
- Verifiable, low tolerance. Delegate with gates. Examples: fixing a failing unit test; fixing a customer-reported bug.
- Not yet verifiable, high tolerance. Use the agent for exploration, then extract constraints from what you learn. Examples: exploring UI layouts for a new feature; brainstorming marketing copy.
- Not yet verifiable, low tolerance. Don’t delegate the decision yet. First use the agent to produce the artifacts that make the work verifiable. Examples: draft a permission matrix, define invariants, write escalation-path tests, prototype policy-as-code.
High-level work (architecture, strategy, trade-offs) often starts in the hardest case: low verifiability, low tolerance. Assumptions hide best there. But high-level work is usually decomposable. Break it into constrained subtasks, then delegate those.
For example, “defining multi-tenant permissions” is low tolerance and low verifiability at the start. Don’t delegate the decision; delegate the work of making it verifiable: draft a permission matrix and invariants, write tests for escalation paths, prototype a policy-as-code layer. Once those constraints exist, implementation becomes a low-tolerance but verifiable task.
The Acceleration of Debt
None of this is new engineering wisdom. What changes with agents is the rate at which small omissions compound.
Humans bridge gaps socially: they ask questions, notice contradictions, remember “that one incident from last year,” and hesitate when something feels off. Agents don’t get those dampeners. They will happily produce plausible work until the system forces contact with reality.
That’s why what used to be technical debt becomes context failure. If a constraint (decision records, schemas, invariants, style guides, runbooks) isn’t captured in the repo, it effectively doesn’t exist for an agent. Treat context as code, and treat verification as the termination condition, not a suggestion.
The future workflow isn’t exotic. It’s the old best practices, made load-bearing by speed. Start with the last thing your agent got wrong. Turn it into a constraint or a check. Wire it into the loop, then repeat.