{
  "version": "https://jsonfeed.org/version/1.1",
  "title": "Flexcompute Engineering",
  "home_page_url": "https://engineering.flexcompute.com",
  "feed_url": "https://engineering.flexcompute.com/feed.json",
  "description": "Essays, tutorials, and case studies on AI engineering, computational physics, photonics, and simulation from Flexcompute.",
  "items": [
    {
      "id": "https://engineering.flexcompute.com/articles/photonic-inverse-design-45-lines/",
      "url": "https://engineering.flexcompute.com/articles/photonic-inverse-design-45-lines/",
      "title": "Designing a Photonic Chip Component with ~45 Lines of Python",
      "summary": "A compact introduction to photonic inverse design with Tidy3D, using a pre-built simulation and a ~45-line optimization loop.",
      "image": "https://engineering.flexcompute.com/images/og/photonic-inverse-design-45-lines.png",
      "banner_image": "https://engineering.flexcompute.com/images/og/photonic-inverse-design-45-lines.png",
      "date_published": "2026-03-05T00:00:00.000Z",
      "date_modified": "2026-03-05T00:00:00.000Z",
      "authors": [
        {
          "name": "Tyler Hughes",
          "path": "/authors/tyler-hughes/",
          "url": "https://engineering.flexcompute.com/authors/tyler-hughes/"
        }
      ],
      "tags": [
        "Photonics",
        "Inverse Design",
        "Optimization",
        "Tidy3D"
      ],
      "content_html": "<p><strong>Photonic chips</strong> guide light through tiny waveguides etched into silicon, much like electrical wires carry current on a circuit board. These chips are increasingly important for high-speed data links, sensing, and quantum computing. Routing light around corners is surprisingly hard: a smooth 90-degree waveguide bend often needs a radius of several microns to keep loss low, and that takes up valuable chip real estate. What if a computer could design a structure that makes the bend in a fraction of the space?</p>\n<p>That's the idea behind <strong>inverse design</strong>: instead of designing a device and checking if it works, you specify <em>what</em> you want and let an algorithm figure out the geometry, pixel by pixel, using the same gradient-based methods that train neural networks.</p>\n<p>The idea isn't new. Structural engineers have used topology optimization to design bridges and aircraft parts since the 1980s, and <a href=\"https://link.springer.com/article/10.1007/s001580050176\">Sigmund's \"99 line topology optimization code\"</a> showed the core algorithm fits in a single MATLAB script. This post does the same for photonic inverse design: once the base simulation is given, the core optimization loop fits in <strong>~45 lines of Python</strong>.</p>\n<p>Let's build it.</p>\n<div class=\"article-overview\">\n  <p class=\"article-overview__eyebrow\">At a glance</p>\n  <div class=\"article-overview__grid\">\n    <section>\n      <h3>Goal</h3>\n      <p>Route 1.0 μm light around a 90-degree bend inside a 3x3 μm design region.</p>\n    </section>\n    <section>\n      <h3>Method</h3>\n      <p>Use Tidy3D's adjoint gradients to optimize a pixelized material layout with Adam.</p>\n    </section>\n    <section>\n      <h3>Outcome</h3>\n      <p>\n        In ten iterations, the design routes roughly 89% of the power into the desired output mode.\n      </p>\n    </section>\n  </div>\n</div>\n<h2>The Problem: Bending Light on a Chip</h2>\n<p>Light travels through a <strong>waveguide</strong>, a thin strip of high-refractive-index material (like silicon) surrounded by a lower-index material (like air). The light is confined to the strip by total internal reflection, similar to how fiber optics work.</p>\n<p>We want to route light at wavelength 1.0 μm around a <strong>90-degree corner</strong>. It enters horizontally from the left and must exit vertically downward. Between input and output sits a <strong>design region</strong>, a 3x3 μm square where the optimizer can freely place or remove material. The question is: <em>what pattern maximally routes the light from input to output?</em></p>\n<p>To simulate how light propagates through a given geometry, we use <a href=\"https://www.flexcompute.com/tidy3d/\">Tidy3D</a>, a cloud-based electromagnetic solver. Given a device geometry and material properties, Tidy3D solves Maxwell's equations and tells us where the light goes. Crucially, Tidy3D exposes an <a href=\"https://github.com/HIPS/autograd\">autograd</a>-based inverse-design workflow, which lets us compute gradients through the simulation (more on this in Step 3).</p>\n<p>The base simulation (waveguides, light source, output monitor, and absorbing boundary conditions) is pre-built and stored in <code>sim_base.yaml</code>. We load it and focus entirely on the optimization algorithm.</p>\n<figure class=\"article-figure article-figure--medium\">\n  <img\n    src=\"https://engineering.flexcompute.com/images/photonic-inverse-design/simulation-setup.png\"\n    alt=\"The simulation setup. Light enters from the left through a horizontal waveguide and should exit downward through a vertical waveguide. The dashed box is the design region where we will optimize the material layout.\"\n  />\n  <figcaption>\n    <strong>Simulation setup.</strong> Light enters from the left through a horizontal waveguide and\n    should exit downward through a vertical waveguide. The dashed box is the design region where we\n    will optimize the material layout.\n  </figcaption>\n</figure>\n<pre><code class=\"language-python\">import autograd\nimport autograd.numpy as np\nimport tidy3d as td\nimport tidy3d.web as web\nfrom tidy3d.plugins.autograd import make_filter_and_project\n\nsim_base = td.Simulation.from_file(\"sim_base.yaml\")\n</code></pre>\n<h2>Step 1: From Design Variables to Simulation</h2>\n<p>We need a function that maps a set of <strong>design variables</strong> to a complete electromagnetic simulation. Each pixel in the design region gets a variable <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>ρ</mi></mrow><annotation encoding=\"application/x-tex\">\\rho</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.625em;vertical-align:-0.1944em;\"></span><span class=\"mord mathnormal\">ρ</span></span></span></span>, a number between 0 and 1. We then convert <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>ρ</mi></mrow><annotation encoding=\"application/x-tex\">\\rho</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.625em;vertical-align:-0.1944em;\"></span><span class=\"mord mathnormal\">ρ</span></span></span></span> to a <strong>permittivity</strong> value. Permittivity is the square of the refractive index and controls how light interacts with the material. Our material has refractive index <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>n</mi><mo>=</mo><mn>2</mn></mrow><annotation encoding=\"application/x-tex\">n = 2</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.4306em;\"></span><span class=\"mord mathnormal\">n</span><span class=\"mspace\" style=\"margin-right:0.2778em;\"></span><span class=\"mrel\">=</span><span class=\"mspace\" style=\"margin-right:0.2778em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:0.6444em;\"></span><span class=\"mord\">2</span></span></span></span>, so its permittivity is <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><msup><mi>n</mi><mn>2</mn></msup><mo>=</mo><mn>4</mn></mrow><annotation encoding=\"application/x-tex\">n^2 = 4</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.8141em;\"></span><span class=\"mord\"><span class=\"mord mathnormal\">n</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.8141em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">2</span></span></span></span></span></span></span></span><span class=\"mspace\" style=\"margin-right:0.2778em;\"></span><span class=\"mrel\">=</span><span class=\"mspace\" style=\"margin-right:0.2778em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:0.6444em;\"></span><span class=\"mord\">4</span></span></span></span>. Air has permittivity 1.</p>\n<p>But we don't map <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>ρ</mi></mrow><annotation encoding=\"application/x-tex\">\\rho</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.625em;vertical-align:-0.1944em;\"></span><span class=\"mord mathnormal\">ρ</span></span></span></span> to permittivity directly. Two transformations happen first:</p>\n<h3>Density filter</h3>\n<p>A <strong>convolutional filter</strong> with radius <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>R</mi></mrow><annotation encoding=\"application/x-tex\">R</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.6833em;\"></span><span class=\"mord mathnormal\" style=\"margin-right:0.00773em;\">R</span></span></span></span> blurs each pixel's value with its neighbors. This is important because these devices will eventually be fabricated, and real manufacturing processes have a <strong>minimum feature size</strong> they can reliably produce. The filter ensures no feature in our design is smaller than <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>R</mi></mrow><annotation encoding=\"application/x-tex\">R</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.6833em;\"></span><span class=\"mord mathnormal\" style=\"margin-right:0.00773em;\">R</span></span></span></span>, acting as a simple proxy for more sophisticated fabrication-aware design checks. We use <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>R</mi><mo>=</mo><mn>150</mn><mtext> </mtext><mtext>nm</mtext></mrow><annotation encoding=\"application/x-tex\">R = 150\\,\\text{nm}</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.6833em;\"></span><span class=\"mord mathnormal\" style=\"margin-right:0.00773em;\">R</span><span class=\"mspace\" style=\"margin-right:0.2778em;\"></span><span class=\"mrel\">=</span><span class=\"mspace\" style=\"margin-right:0.2778em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:0.6444em;\"></span><span class=\"mord\">150</span><span class=\"mspace\" style=\"margin-right:0.1667em;\"></span><span class=\"mord text\"><span class=\"mord\">nm</span></span></span></span></span>.</p>\n<h3>Tanh projection</h3>\n<p>After filtering, a <strong>tanh function</strong> pushes the smoothed values toward 0 or 1, controlled by a sharpness parameter <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>β</mi></mrow><annotation encoding=\"application/x-tex\">\\beta</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.8889em;vertical-align:-0.1944em;\"></span><span class=\"mord mathnormal\" style=\"margin-right:0.05278em;\">β</span></span></span></span>. At low <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>β</mi></mrow><annotation encoding=\"application/x-tex\">\\beta</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.8889em;vertical-align:-0.1944em;\"></span><span class=\"mord mathnormal\" style=\"margin-right:0.05278em;\">β</span></span></span></span>, the mapping is nearly linear, so the optimizer can explore intermediate values freely. At high <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>β</mi></mrow><annotation encoding=\"application/x-tex\">\\beta</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.8889em;vertical-align:-0.1944em;\"></span><span class=\"mord mathnormal\" style=\"margin-right:0.05278em;\">β</span></span></span></span>, it becomes a hard threshold that forces every pixel to be pure material or pure air.</p>\n<p>The figure below shows the effect of these two transformations applied to random noise. Filtering smooths out fine features; projection pushes values toward binary. When applied during optimization, they guide the optimizer toward clean, fabricable geometries.</p>\n<figure class=\"article-figure article-figure--medium\">\n  <img\n    src=\"https://engineering.flexcompute.com/images/photonic-inverse-design/filter-projection.png\"\n    alt=\"Effect of density filter (columns) and tanh projection (rows) applied to random noise. Without filtering (left), arbitrarily small features remain. The filter with R = 150 nm enforces a minimum feature size (right). Without projection (top), intermediate values persist. High beta projection pushes toward binary material and air (bottom).\"\n    loading=\"lazy\"\n  />\n  <figcaption>\n    <strong>Filtering and projection.</strong> Filtering suppresses arbitrarily small features,\n    while increasing beta pushes the design toward a binary material-air pattern. We use both\n    together during optimization.\n  </figcaption>\n</figure>\n<pre><code class=\"language-python\">n_mat = 2.0                    # material refractive index\neps_mat = n_mat ** 2           # permittivity = n^2 = 4.0\ndesign_size = 3.0              # design region side length (um)\npixel_size = 1.0 / 50          # pixel resolution (um)\nradius = 0.150                 # filter radius R (um), sets minimum feature size\nnx = ny = int(design_size / pixel_size)\n\ndesign_region_geo = td.Box(center=(0, 0, 0), size=(design_size, design_size, td.inf))\n\nfilter_project = make_filter_and_project(radius=radius, dl=pixel_size)\n</code></pre>\n<p>Now we construct the function that takes our design parameters <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>ρ</mi></mrow><annotation encoding=\"application/x-tex\">\\rho</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.625em;vertical-align:-0.1944em;\"></span><span class=\"mord mathnormal\">ρ</span></span></span></span>, applies the filter and projection to get a permittivity map, builds a structure from it, and adds it to the base simulation we loaded from file.</p>\n<pre><code class=\"language-python\">def make_sim(params, beta):\n    \"\"\"Map design variables through filter, projection, and into a simulation.\"\"\"\n    density = filter_project(params, beta=beta)\n    eps_data = 1.0 + (eps_mat - 1.0) * density\n    structure = td.Structure.from_permittivity_array(\n        eps_data=eps_data, geometry=design_region_geo,\n    )\n    return sim_base.updated_copy(\n        structures=list(sim_base.structures) + [structure],\n    )\n</code></pre>\n<h2>Step 2: Objective Function</h2>\n<p>We need a single number that tells us how well the device works. At the output waveguide, Tidy3D measures the <strong>mode amplitude</strong> <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>a</mi></mrow><annotation encoding=\"application/x-tex\">a</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.4306em;\"></span><span class=\"mord mathnormal\">a</span></span></span></span>, a complex number describing how much light couples into the waveguide's guided mode. The power carried by that mode is <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi mathvariant=\"normal\">∣</mi><mi>a</mi><msup><mi mathvariant=\"normal\">∣</mi><mn>2</mn></msup></mrow><annotation encoding=\"application/x-tex\">|a|^2</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:1.0641em;vertical-align:-0.25em;\"></span><span class=\"mord\">∣</span><span class=\"mord mathnormal\">a</span><span class=\"mord\"><span class=\"mord\">∣</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.8141em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">2</span></span></span></span></span></span></span></span></span></span></span>, so our figure of merit is simply the output mode power:</p>\n<span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\" display=\"block\"><semantics><mrow><mrow><mi mathvariant=\"normal\">F</mi><mi mathvariant=\"normal\">O</mi><mi mathvariant=\"normal\">M</mi></mrow><mo>=</mo><mi mathvariant=\"normal\">∣</mi><mi>a</mi><msup><mi mathvariant=\"normal\">∣</mi><mn>2</mn></msup></mrow><annotation encoding=\"application/x-tex\">\\mathrm{FOM} = |a|^2</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.6833em;\"></span><span class=\"mord\"><span class=\"mord mathrm\">FOM</span></span><span class=\"mspace\" style=\"margin-right:0.2778em;\"></span><span class=\"mrel\">=</span><span class=\"mspace\" style=\"margin-right:0.2778em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:1.1141em;vertical-align:-0.25em;\"></span><span class=\"mord\">∣</span><span class=\"mord mathnormal\">a</span><span class=\"mord\"><span class=\"mord\">∣</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.8641em;\"><span style=\"top:-3.113em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">2</span></span></span></span></span></span></span></span></span></span></span></span>\n<p>A perfect device would have <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mrow><mi mathvariant=\"normal\">F</mi><mi mathvariant=\"normal\">O</mi><mi mathvariant=\"normal\">M</mi></mrow><mo>=</mo><mn>1</mn></mrow><annotation encoding=\"application/x-tex\">\\mathrm{FOM} = 1</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.6833em;\"></span><span class=\"mord\"><span class=\"mord mathrm\">FOM</span></span><span class=\"mspace\" style=\"margin-right:0.2778em;\"></span><span class=\"mrel\">=</span><span class=\"mspace\" style=\"margin-right:0.2778em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:0.6444em;\"></span><span class=\"mord\">1</span></span></span></span> (all input power reaching the output). In code: we build a simulation from our design variables, run it on the cloud, extract the mode amplitude, and return the power.</p>\n<pre><code class=\"language-python\">def objective(params, beta):\n    \"\"\"Run electromagnetic simulation and return output mode power.\"\"\"\n    sim = make_sim(params, beta)\n    data = web.run(sim, task_name=\"invdes\", verbose=False)\n    amps = data[\"mode\"].amps.sel(direction=\"-\", mode_index=0).values\n    return np.sum(np.abs(amps) ** 2)\n</code></pre>\n<h2>Step 3: Gradients via the Adjoint Method</h2>\n<p>To optimize, we need the gradient <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>d</mi><mrow><mi mathvariant=\"normal\">F</mi><mi mathvariant=\"normal\">O</mi><mi mathvariant=\"normal\">M</mi></mrow><mi mathvariant=\"normal\">/</mi><mi>d</mi><mi>ρ</mi></mrow><annotation encoding=\"application/x-tex\">d\\mathrm{FOM}/d\\rho</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"></span><span class=\"mord mathnormal\">d</span><span class=\"mord\"><span class=\"mord mathrm\">FOM</span></span><span class=\"mord\">/</span><span class=\"mord mathnormal\">d</span><span class=\"mord mathnormal\">ρ</span></span></span></span> for every pixel: how does tweaking the <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>ρ</mi></mrow><annotation encoding=\"application/x-tex\">\\rho</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.625em;vertical-align:-0.1944em;\"></span><span class=\"mord mathnormal\">ρ</span></span></span></span> value of each pixel affect the output power? The brute-force approach would perturb each pixel one at a time and re-simulate. For our 150x150 grid, that's <strong>22,500 simulations</strong> per optimization step. Completely impractical.</p>\n<p>The <strong>adjoint method</strong> computes the <em>exact same gradient</em> using just <strong>two simulations</strong>, regardless of how many pixels there are:</p>\n<ol>\n<li><strong>Forward simulation</strong>: run the device normally, injecting light at the input and recording the electric field everywhere. This is the simulation we'd run anyway to evaluate the design.</li>\n<li><strong>Adjoint simulation</strong>: inject a special source <em>at the output monitor</em> that encodes the derivative of our objective function. This tells the simulation \"how much does the objective change if the field here changes?\" The resulting adjoint fields propagate backward through the device.</li>\n</ol>\n<p>After both simulations, the gradient at each pixel is simply the <strong>overlap of the forward and adjoint electric fields</strong>:</p>\n<span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\" display=\"block\"><semantics><mrow><mfrac><mrow><mi mathvariant=\"normal\">∂</mi><mi>F</mi></mrow><mrow><mi mathvariant=\"normal\">∂</mi><msub><mi>ε</mi><mi>i</mi></msub></mrow></mfrac><mo>∝</mo><mi mathvariant=\"normal\">Re</mi><mo>⁡</mo><mrow><mo fence=\"true\">(</mo><msub><mi>E</mi><mi>i</mi></msub><mo>⋅</mo><msubsup><mi>E</mi><mi>i</mi><mrow><mi mathvariant=\"normal\">a</mi><mi mathvariant=\"normal\">d</mi><mi mathvariant=\"normal\">j</mi></mrow></msubsup><mo fence=\"true\">)</mo></mrow></mrow><annotation encoding=\"application/x-tex\">\\frac{\\partial F}{\\partial \\varepsilon_i} \\propto \\operatorname{Re}\\left(E_i \\cdot E_i^{\\mathrm{adj}}\\right)</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:2.2074em;vertical-align:-0.836em;\"></span><span class=\"mord\"><span class=\"mopen nulldelimiter\"></span><span class=\"mfrac\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:1.3714em;\"><span style=\"top:-2.314em;\"><span class=\"pstrut\" style=\"height:3em;\"></span><span class=\"mord\"><span class=\"mord\" style=\"margin-right:0.05556em;\">∂</span><span class=\"mord\"><span class=\"mord mathnormal\">ε</span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.3117em;\"><span style=\"top:-2.55em;margin-left:0em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i</span></span></span></span><span class=\"vlist-s\">​</span></span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.15em;\"><span></span></span></span></span></span></span></span></span><span style=\"top:-3.23em;\"><span class=\"pstrut\" style=\"height:3em;\"></span><span class=\"frac-line\" style=\"border-bottom-width:0.04em;\"></span></span><span style=\"top:-3.677em;\"><span class=\"pstrut\" style=\"height:3em;\"></span><span class=\"mord\"><span class=\"mord\" style=\"margin-right:0.05556em;\">∂</span><span class=\"mord mathnormal\" style=\"margin-right:0.13889em;\">F</span></span></span></span><span class=\"vlist-s\">​</span></span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.836em;\"><span></span></span></span></span></span><span class=\"mclose nulldelimiter\"></span></span><span class=\"mspace\" style=\"margin-right:0.2778em;\"></span><span class=\"mrel\">∝</span><span class=\"mspace\" style=\"margin-right:0.2778em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:1.8em;vertical-align:-0.65em;\"></span><span class=\"mop\"><span class=\"mord mathrm\">Re</span></span><span class=\"mspace\" style=\"margin-right:0.1667em;\"></span><span class=\"minner\"><span class=\"mopen delimcenter\" style=\"top:0em;\"><span class=\"delimsizing size2\">(</span></span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right:0.05764em;\">E</span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.3117em;\"><span style=\"top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i</span></span></span></span><span class=\"vlist-s\">​</span></span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.15em;\"><span></span></span></span></span></span></span><span class=\"mspace\" style=\"margin-right:0.2222em;\"></span><span class=\"mbin\">⋅</span><span class=\"mspace\" style=\"margin-right:0.2222em;\"></span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right:0.05764em;\">E</span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.967em;\"><span style=\"top:-2.4231em;margin-left:-0.0576em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i</span></span></span><span style=\"top:-3.1809em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\"><span class=\"mord mathrm mtight\">adj</span></span></span></span></span></span><span class=\"vlist-s\">​</span></span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.2769em;\"><span></span></span></span></span></span></span><span class=\"mclose delimcenter\" style=\"top:0em;\"><span class=\"delimsizing size2\">)</span></span></span></span></span></span></span>\n<p>Two simulations instead of 22,500. Intuitively, the forward field tells you \"how strongly does this pixel interact with the input light?\" and the adjoint field tells you \"how strongly does this pixel influence the output?\" Their product gives the sensitivity of the objective to each pixel.</p>\n<p>This is the same principle behind <a href=\"https://jingnanshi.com/blog/autodiff.html\"><strong>backpropagation</strong></a> in neural networks. The adjoint simulation is the <a href=\"https://jingnanshi.com/blog/autodiff.html\">vector-Jacobian product (VJP)</a> of the forward electromagnetic solve, and both exploit the chain rule to avoid redundant computation (see <a href=\"https://doi.org/10.1021/acsphotonics.0c00327\">Minkov et al., 2020</a> for a detailed treatment connecting adjoint methods and automatic differentiation in photonics).</p>\n<p>Tidy3D implements the adjoint math as the VJP of its electromagnetic solver, so this second simulation happens automatically behind the scenes. When we wrap our objective in <code>autograd.value_and_grad</code>, Tidy3D runs both the forward and adjoint simulations and backpropagates the gradient through the entire computational pipeline (simulation, mode decomposition, filter, projection, and all).</p>\n<pre><code class=\"language-python\">val_and_grad = autograd.value_and_grad(objective)\n# val_and_grad is a function: given (params, beta), it returns (fom, gradient)\n# e.g. fom, grad = val_and_grad(params, beta=10)\n</code></pre>\n<h2>Step 4: Optimize</h2>\n<p>With cheap gradients in hand, we can use any gradient-based optimizer. The <code>autograd</code> library provides <strong>Adam</strong> out of the box. Adam is the same optimizer that trains most neural networks: it maintains running averages of the gradient (momentum) and its square (adaptive learning rate), giving more stable convergence than plain gradient ascent.</p>\n<p>Adam's <code>grad</code> function takes <code>(params, iteration)</code>. We use the iteration number to <strong>gradually increase <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>β</mi></mrow><annotation encoding=\"application/x-tex\">\\beta</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.8889em;vertical-align:-0.1944em;\"></span><span class=\"mord mathnormal\" style=\"margin-right:0.05278em;\">β</span></span></span></span></strong>: early on, a low <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>β</mi></mrow><annotation encoding=\"application/x-tex\">\\beta</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.8889em;vertical-align:-0.1944em;\"></span><span class=\"mord mathnormal\" style=\"margin-right:0.05278em;\">β</span></span></span></span> keeps the design continuous, giving the optimizer freedom to explore many possible solutions; as the design matures, we ramp <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>β</mi></mrow><annotation encoding=\"application/x-tex\">\\beta</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.8889em;vertical-align:-0.1944em;\"></span><span class=\"mord mathnormal\" style=\"margin-right:0.05278em;\">β</span></span></span></span> up to push toward a binary (material or air) structure that can actually be fabricated. We also negate the gradient so that Adam <em>maximizes</em> our objective instead of minimizing.</p>\n<pre><code class=\"language-python\">from autograd.misc.optimizers import adam\n\nn_steps = 10\nparams0 = 0.5 * np.ones((nx, ny, 1))\nhistory, param_history = [], [np.array(params0)]\n\ndef neg_grad(params, i):\n    \"\"\"Negative gradient with projection schedule (Adam minimizes, we negate to maximize).\"\"\"\n    params = np.clip(params, 0, 1)\n    beta = 5 + 45 * i / max(n_steps - 1, 1)\n    fom, g = val_and_grad(params, beta)\n    history.append(float(fom))\n    param_history.append(np.array(params))\n    print(f\"  step {i:2d} | FOM = {fom:.4f} | beta = {beta:.1f}\")\n    return -g\n</code></pre>\n<h2>Run It</h2>\n<p>Starting from a uniform design (<span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>ρ</mi><mo>=</mo><mn>0.5</mn></mrow><annotation encoding=\"application/x-tex\">\\rho = 0.5</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.625em;vertical-align:-0.1944em;\"></span><span class=\"mord mathnormal\">ρ</span><span class=\"mspace\" style=\"margin-right:0.2778em;\"></span><span class=\"mrel\">=</span><span class=\"mspace\" style=\"margin-right:0.2778em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:0.6444em;\"></span><span class=\"mord\">0.5</span></span></span></span> everywhere, halfway between air and material), the optimizer discovers the structure from scratch. Each step runs two simulations on the cloud (forward + adjoint), computes the gradient across all 22,500 pixels, and updates the design.</p>\n<pre><code class=\"language-python\">params_opt = np.clip(adam(neg_grad, params0, num_iters=n_steps, step_size=0.3), 0, 1)\n</code></pre>\n<pre><code>  step  0 | FOM = 0.0022 | beta = 5.0\n  step  1 | FOM = 0.0531 | beta = 10.0\n  step  2 | FOM = 0.3975 | beta = 15.0\n  step  3 | FOM = 0.3932 | beta = 20.0\n  step  4 | FOM = 0.6367 | beta = 25.0\n  step  5 | FOM = 0.7139 | beta = 30.0\n  step  6 | FOM = 0.7795 | beta = 35.0\n  step  7 | FOM = 0.8487 | beta = 40.0\n  step  8 | FOM = 0.8861 | beta = 45.0\n  step  9 | FOM = 0.8856 | beta = 50.0\n</code></pre>\n<h2>Results</h2>\n<figure class=\"article-figure article-figure--results\">\n  <img\n    class=\"article-figure__desktop-image\"\n    src=\"https://engineering.flexcompute.com/images/photonic-inverse-design/optimized-result.png\"\n    alt=\"Left: the optimized permittivity pattern. Right: the electromagnetic field intensity showing light bending from horizontal input to vertical output.\"\n    loading=\"lazy\"\n  />\n  <div class=\"article-figure__mobile-stack\">\n    <img\n      src=\"https://engineering.flexcompute.com/images/photonic-inverse-design/optimized-design-panel.png\"\n      alt=\"Optimized permittivity pattern showing the discovered material layout inside the waveguide bend.\"\n      loading=\"lazy\"\n    />\n    <img\n      src=\"https://engineering.flexcompute.com/images/photonic-inverse-design/optimized-field-panel.png\"\n      alt=\"Field intensity plot showing light bending from the horizontal input into the vertical output waveguide.\"\n      loading=\"lazy\"\n    />\n  </div>\n  <figcaption>\n    <strong>Final device and field response.</strong> The optimized permittivity pattern comes\n    first, with dark regions showing the high-index material and the light background showing air.\n    The corresponding field intensity, |E|^2, shows light entering from the left and bending\n    downward into the output waveguide.\n  </figcaption>\n</figure>\n<p>The final design looks nothing like what a human engineer would draw. There's no smooth curve, no gradual taper. Instead, the optimizer found a pattern of material and air that manipulates the electromagnetic field through interference to route the light around the corner.</p>\n<h3>Design evolution</h3>\n<p>Watch the design emerge from a uniform gray starting point. Early steps explore broadly (low <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>β</mi></mrow><annotation encoding=\"application/x-tex\">\\beta</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.8889em;vertical-align:-0.1944em;\"></span><span class=\"mord mathnormal\" style=\"margin-right:0.05278em;\">β</span></span></span></span>, soft features); later steps sharpen into a clean binary design (high <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>β</mi></mrow><annotation encoding=\"application/x-tex\">\\beta</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.8889em;vertical-align:-0.1944em;\"></span><span class=\"mord mathnormal\" style=\"margin-right:0.05278em;\">β</span></span></span></span>).</p>\n<figure class=\"article-figure article-figure--compact\">\n  <img\n    src=\"https://engineering.flexcompute.com/images/photonic-inverse-design/evolution.gif\"\n    alt=\"Animated evolution of the design over the optimization run, starting from a uniform gray design and sharpening into the final binary structure.\"\n    loading=\"lazy\"\n  />\n  <figcaption>\n    <strong>Optimization trajectory.</strong> The design starts from a uniform gray initialization,\n    develops structure quickly in the early low-beta steps, and then sharpens into a binary pattern\n    as the projection becomes steeper.\n  </figcaption>\n</figure>\n<h3>Convergence</h3>\n<p>The animation shows how the geometry sharpens visually. The convergence trace shows the same story numerically: rapid early gains, then diminishing returns as the design binarizes.</p>\n<figure class=\"article-figure article-figure--compact\">\n  <img\n    src=\"https://engineering.flexcompute.com/images/photonic-inverse-design/convergence.png\"\n    alt=\"The FOM (output mode power) increases rapidly in the first few steps as the optimizer discovers the basic routing structure, then stabilizes as beta increases and the design binarizes.\"\n    loading=\"lazy\"\n  />\n  <figcaption>\n    <strong>Convergence.</strong> The output-mode power rises quickly once the optimizer finds the\n    basic routing pattern, then levels off as the design binarizes.\n  </figcaption>\n</figure>\n<h3>The Complete Pipeline</h3>\n<p>Zooming out, each optimization step does the same five things in order:</p>\n<div class=\"article-process\">\n  <section class=\"article-process__step\">\n    <p class=\"article-process__index\">01</p>\n    <div>\n      <h3>Filter and project</h3>\n      <p>Map the raw design variables into a smooth, increasingly binary material distribution.</p>\n    </div>\n  </section>\n  <section class=\"article-process__step\">\n    <p class=\"article-process__index\">02</p>\n    <div>\n      <h3>Build the simulation</h3>\n      <p>Insert that material layout into the pre-built Tidy3D model of the waveguide bend.</p>\n    </div>\n  </section>\n  <section class=\"article-process__step\">\n    <p class=\"article-process__index\">03</p>\n    <div>\n      <h3>Run the forward solve</h3>\n      <p>Compute the output-mode power, which becomes the figure of merit.</p>\n    </div>\n  </section>\n  <section class=\"article-process__step\">\n    <p class=\"article-process__index\">04</p>\n    <div>\n      <h3>Run the adjoint solve</h3>\n      <p>Backpropagate sensitivity information through the electromagnetic simulation.</p>\n    </div>\n  </section>\n  <section class=\"article-process__step\">\n    <p class=\"article-process__index\">05</p>\n    <div>\n      <h3>Update the design</h3>\n      <p>Use Adam to take a gradient step, then repeat with a slightly sharper projection.</p>\n    </div>\n  </section>\n</div>\n<p>Each iteration costs two electromagnetic simulations. The adjoint method makes this feasible. Without it, we'd need 22,500 simulations per step instead of 2.</p>\n<h2>Going Further</h2>\n<p>This post is a basic introduction to the technique, but a functional one. There are many more advanced variations for real problems in photonic device design. Production systems add:</p>\n<ul>\n<li><strong>3D simulations</strong>: real devices have finite thickness and vertical confinement</li>\n<li><strong>Broadband optimization</strong>: performance across a range of wavelengths, not just one</li>\n<li><strong>Fabrication constraints</strong>: minimum feature sizes, curvature limits, etch profiles</li>\n<li><strong>Multi-objective</strong>: multiple output ports, polarizations, robustness to manufacturing variation</li>\n</ul>\n<p>If you're interested in going deeper, check out these resources:</p>\n<ul>\n<li><a href=\"https://docs.flexcompute.com/projects/tidy3d/en/latest/notebooks/docs/features/autograd.html\">Inverse design examples</a> (wavelength demultiplexers, metalenses, mode converters, and more)</li>\n<li><a href=\"https://www.flexcompute.com/tidy3d/examples/notebooks/Autograd0Quickstart/\">Inverse design quickstart notebook</a> (a more complete worked example using the same autograd workflow)</li>\n<li><a href=\"https://www.flexcompute.com/tidy3d/learning-center/inverse-design/\">Inverse design learning center</a> (course-style introduction to adjoint optimization in Tidy3D)</li>\n</ul>\n<p>Want the exact working files for this post? <a href=\"https://engineering.flexcompute.com/downloads/photonic-inverse-design/photonic-inverse-design-companion.zip\">Download the companion bundle</a>. It includes the notebook, the Jupytext script export, <code>sim_base.yaml</code>, and the helper script that rebuilds the base simulation. Re-running the optimization requires a <a href=\"https://www.simulation.cloud/\">Tidy3D account</a>.</p>",
      "attachments": [
        {
          "url": "https://engineering.flexcompute.com/articles/photonic-inverse-design-45-lines.md",
          "mime_type": "text/markdown",
          "title": "Designing a Photonic Chip Component with ~45 Lines of Python markdown"
        },
        {
          "url": "https://engineering.flexcompute.com/images/og/photonic-inverse-design-45-lines.png",
          "mime_type": "image/png",
          "title": "Designing a Photonic Chip Component with ~45 Lines of Python social image"
        }
      ],
      "_flexcompute": {
        "kind": "Tutorial",
        "tags": [
          "photonics",
          "inverse-design",
          "optimization",
          "tidy3d"
        ],
        "markdown_url": "https://engineering.flexcompute.com/articles/photonic-inverse-design-45-lines.md"
      }
    },
    {
      "id": "https://engineering.flexcompute.com/articles/what-should-we-work-on-next/",
      "url": "https://engineering.flexcompute.com/articles/what-should-we-work-on-next/",
      "title": "\"What Should We Work On Next?\"",
      "summary": "The story of building an 80,000-line autodiff library almost entirely through AI agents — and the verification infrastructure that made it possible.",
      "image": "https://engineering.flexcompute.com/images/og/what-should-we-work-on-next.png",
      "banner_image": "https://engineering.flexcompute.com/images/og/what-should-we-work-on-next.png",
      "date_published": "2026-02-26T00:00:00.000Z",
      "date_modified": "2026-02-26T00:00:00.000Z",
      "authors": [
        {
          "name": "Yannick Augenstein",
          "path": "/authors/yannick-augenstein/",
          "url": "https://engineering.flexcompute.com/authors/yannick-augenstein/"
        },
        {
          "name": "Frederik Schubert",
          "path": "/authors/frederik-schubert/",
          "url": "https://engineering.flexcompute.com/authors/frederik-schubert/"
        }
      ],
      "tags": [
        "AI Engineering",
        "Autodiff",
        "Verification"
      ],
      "content_html": "<p>import HarnessWorkflowFigure from '../../components/HarnessWorkflowFigure.astro';\nimport MutationPipelineFigure from '../../components/MutationPipelineFigure.astro';</p>\n<p>I have been building with AI coding agents for over a year.<sup><a href=\"#user-content-fn-1\" id=\"user-content-fnref-1\" data-footnote-ref aria-describedby=\"footnote-label\">1</a></sup> Most of that time was unstructured: I used whatever worked, fixed problems as they came up, and did not pay much attention to the patterns. Then, starting around November 2025, I spent three months building a mostly personal internal autodiff library: graph-based, eagerly traced, NumPy-integrated, and almost entirely agent-written. By the end, it had reached roughly 80,000 lines of Python.</p>\n<p>I wanted a graph-first, NumPy-native library where the graph itself was explicit, inspectable, and available for transformations beyond autodiff. JAX is excellent, but it optimizes for a different set of tradeoffs: staged execution, JIT compilation, and the accelerator stack.</p>\n<p>Over those three months, my role shifted from reviewing every line of agent output to opening many sessions with a single question: \"what should we work on next?\" That shift did not happen only because the agents improved. It happened because the verification infrastructure improved enough that I could trust the floor: a minimum quality bar would hold regardless of what the agent produced.</p>\n<p>This article is about that infrastructure — what it looks like, how it grew, and why it matters more than you might think. But the infrastructure did not build itself. Every check exists because a human saw a failure and decided to formalize the fix. The code is not open source, so the point of this article is not the artifact itself, but the verification patterns and harness design that emerged while building it.</p>\n<h2>Directed development</h2>\n<p>This library did not start with agents. I had been sketching it on the side for a while — a design document, a basic tracing engine, some scaffolding. By the time I opened the first AI session, the repo had about a thousand lines of Python and a clear picture of what the library should be. What it did not have was momentum. A side project I picked up between other work, never for long enough to get past the foundation.</p>\n<p>The first session was a comparison: how did the library stack up against autograd and MyGrad?<sup><a href=\"#user-content-fn-2\" id=\"user-content-fnref-2\" data-footnote-ref aria-describedby=\"footnote-label\">2</a></sup> I was using the agent as a reviewer, asking for architectural opinions. When the agent got the target API wrong — proposing explicit tracing contexts when the library was designed for implicit tracing — I corrected it. The agent was a consultant. I was the authority.</p>\n<p>The second session set up CI: ruff with all rules enabled, mypy strict, <a href=\"https://hypothesis.readthedocs.io/en/latest/\">Hypothesis</a> property-based testing, conventional commits, and AGENTS.md.<sup><a href=\"#user-content-fn-3\" id=\"user-content-fnref-3\" data-footnote-ref aria-describedby=\"footnote-label\">3</a></sup> None of this was about agent harnesses; I did not even know the term yet. Engineering discipline and personal curiosity, which turned out to benefit agents enormously.</p>\n<p>What followed was a burst of rapid development. Over a handful of intense sessions, the codebase grew from a small prototype into something much larger: graph optimization passes, workflow orchestration, xarray integration, debugging and visualization tools. But the derivative engine remained the most verification-intensive part, and the one that drove most of the harness. My workflow settled into a ritual:</p>\n<blockquote>\n<p>\"review the current state of the repo, what do you think would be highest leverage to work on next?\"</p>\n</blockquote>\n<p>Let the agent propose priorities. Select or redirect. Ask for a detailed implementation plan with user stories, test strategy, and acceptance criteria. Review it. Paste it back with: \"PLEASE IMPLEMENT THIS PLAN.\"<sup><a href=\"#user-content-fn-4\" id=\"user-content-fnref-4\" data-footnote-ref aria-describedby=\"footnote-label\">4</a></sup></p>\n<h3>Where agents drift</h3>\n<p>The workflow was productive but not self-correcting. Pushing back mattered — and I had to push back constantly.</p>\n<p>On VJP coverage, the agent claimed near-complete parity with autograd. I was skeptical; autograd had far more VJPs than that.<sup><a href=\"#user-content-fn-5\" id=\"user-content-fnref-5\" data-footnote-ref aria-describedby=\"footnote-label\">5</a></sup> I was right — the comparison was incomplete. Compatibility shims revealed a similar pattern: the agent kept proposing backward-compatibility layers for a greenfield project with no users.</p>\n<p>The shims deserve their own mention because they are so universal. Old behavior kept via fallbacks and re-exports, new features built on top, instead of clean breaks. Even with explicit instructions, the agents reached for shims.<sup><a href=\"#user-content-fn-6\" id=\"user-content-fnref-6\" data-footnote-ref aria-describedby=\"footnote-label\">6</a></sup> If you are building anything with coding agents, you will hit this.</p>\n<p>Files bloated too. Core modules grew past 1,800 lines. Test files reached similar sizes. A vicious cycle: longer files fill agent context faster, the agent gets worse at navigating the codebase, the code gets worse, the files grow further. The 500-line file limit I eventually imposed came directly from that pain.</p>\n<h3>Pushing back</h3>\n<p>One moment captures why human judgment still matters in this workflow.</p>\n<p>Autodiff frameworks need derivative rules for every operation, typically written as either VJPs (reverse-mode pullbacks) or JVPs (forward-mode pushforwards).<sup><a href=\"#user-content-fn-7\" id=\"user-content-fnref-7\" data-footnote-ref aria-describedby=\"footnote-label\">7</a></sup> This library had accumulated reverse-mode rules first, like most frameworks. In the middle of a session about expanding derivative coverage, the agent proposed deriving JVP rules from the existing VJP rules. That sounded reasonable.</p>\n<p>Then I asked the question that changed the architecture: \"Doesn't it scale better to make JVPs the default source of truth and derive VJPs from them where possible, rather than vice versa?\"</p>\n<p>For this codebase, yes. JVPs were the cleaner authoring primitive for much of the covered NumPy surface, and reverse-mode pullbacks could often be synthesized through the transpose machinery. Some operations still needed explicit reverse-mode exceptions for correctness or performance, but the default direction was backwards from what the agent proposed. Roughly twenty commits in a single day shifted the covered NumPy surface to a JVP-first policy.</p>\n<p>Later that evening, I checked the agent's work again: if JVPs were now the default source of truth, why did the rule layout still look overwhelmingly reverse-mode? The file structure told a different story than the runtime claims. The agent had wired the runtime correctly, but too much of the old reverse-mode formula structure was still in place. It took another full session — running overnight, largely autonomously — to make the migration real in the authored rule layout, not just the plumbing.</p>\n<p>A well-defined mathematical problem, clear correctness criteria, and a human catching both the architectural direction and the incomplete execution. If I hadn't asked, I probably wouldn't have caught it until the derivative layer was deeply entrenched — and unwinding it would have been painful.</p>\n<h2>The proto-harness</h2>\n<p>The JVP migration went (reasonably) well because the problem was mathematically well-defined. Most problems aren't. Every piece of verification infrastructure that followed was a response to something that went wrong.</p>\n<p>The first tool was <code>pre_pr.sh</code> — a small shell script born from watching the agent skip steps with each PR. It was the \"<a href=\"https://engineering.flexcompute.com/articles/agent-control-loop#unenforced-verification\">unenforced verification</a>\" failure mode from the first article, playing out in real time. The agent would forget to run mypy, or skip the VJP coverage check, or not rebuild docs. The script was my first attempt at \"one command to verify everything\": check for a clean working tree, rebase on main, run the linter, type checker, tests, coverage checks, grad contracts, and docs build, in sequence. If any step failed, the whole thing failed.</p>\n<p>The shell script could not keep up, so I replaced it with <code>quality.py</code> — a Python CLI consolidating the scripts into a single framework.</p>\n<p>But the commands to run verification differed across AGENTS.md, CI workflow files, and docs.<sup><a href=\"#user-content-fn-8\" id=\"user-content-fnref-8\" data-footnote-ref aria-describedby=\"footnote-label\">8</a></sup> And nothing was scoped: every check ran against the full codebase regardless of what changed.</p>\n<p>Each tool was a response to pain.</p>\n<h3>Boundaries</h3>\n<p>The architecture split into layers: a backend-agnostic <code>core</code>, a <code>numpy</code> integration layer, and a <code>grad</code> package for autodiff. The intent was clean separation — core should have zero knowledge of NumPy, so adding a CuPy or JAX backend later would be more tractable. It also gave agents clear lanes to work in.</p>\n<p>On the day this layered architecture was declared, the boundary checker caught its first violation within hours. Then another. The architecture was established, violated, fixed, violated again, and fixed again. The original checker only caught cross-layer internal-package imports; it missed bare <code>import numpy</code> entirely, so I had to extend it, again and again.</p>\n<p>The boundary checker ran continuously, but it only caught what it knew to look for. Ten weeks later, I manually scanned what the agent had built in the gradient package and found dozens of files with <code>import numpy as np</code>, well over a thousand <code>np.*</code> callsites, and a package that would not run without numpy installed even though its <code>pyproject.toml</code> declared no numpy dependency. After a big-bang refactor, the agent introduced a backend-agnostic proxy module — but exported the proxy object as <code>np</code>, not <code>xp</code>. Those files then did <code>from ..._backend_runtime import np</code>. Syntactically different from <code>import numpy as np</code>. Visually identical.</p>\n<p>Each round taught me to probe deeper. Was there <em>any</em> code in core that imported numpy, even lazily? Any mention of numpy in core, even as a variable name or string literal? There were — hardcoded <code>\"numpy\"</code> string literals, prefix-coupled registration, multiple compatibility wrapper modules. Every question I learned to ask was a check I should have automated earlier.</p>\n<h3>Curating context</h3>\n<p>Boundary enforcement was about rules within sessions. The next problem was continuity between them.</p>\n<p>I tried to scale development with sub-agents. In Claude Code, I built seven specialized agents: dispatch, test-gen, quality, architect, numpy-protocol, docs, debug. An agent team, each specialized on a domain. It did not stick. I burned a lot of tokens, but the output quality was arguably worse than if I had stuck with a single agent and manual steering. I think the problem was context: each agent starts from only a prompt and has to derive all its context from there. Handoff documents either included too little — and the receiving agent made wrong assumptions — or too much — and the agent could not distinguish signal from background.</p>\n<p>What did work was something simpler. Near the end of a long session, I was about to ask the agent to continue with the next phase. Instead, I asked it to write the prompt I should use for that next phase.</p>\n<p>The agent generated a detailed handoff prompt — project state, remaining work, constraints, validation commands — that I would then curate and paste into a fresh session. It was a \"relay\" pattern, where the prompt is a compressed representation of what matters: what was just done, what remains, what constraints apply.<sup><a href=\"#user-content-fn-9\" id=\"user-content-fnref-9\" data-footnote-ref aria-describedby=\"footnote-label\">9</a></sup></p>\n<p>Within days, I was running multiple agents in parallel. Each session got its own git worktree.<sup><a href=\"#user-content-fn-10\" id=\"user-content-fnref-10\" data-footnote-ref aria-describedby=\"footnote-label\">10</a></sup> I broke tasks down into work streams and dispatched them: \"Give me a prompt for each of these work streams, and tell me which ones I can kick off in parallel.\" Without <code>quality.py</code> running in each worktree, parallel agents would have been utter chaos. But now, each agent could independently verify its own changes, enabling a workflow that would not have been sustainable otherwise.</p>\n<h2>Properly, this time</h2>\n<p>Still, the quality gate turned red.<sup><a href=\"#user-content-fn-11\" id=\"user-content-fnref-11\" data-footnote-ref aria-describedby=\"footnote-label\">11</a></sup> The <code>jvp_grad_runtime_ratio</code> had drifted above the 10.0x threshold because of measurement noise on higher-order workloads. It was effectively blocking merges unrelated to performance.</p>\n<p>Around the same time, \"harness engineering\" was becoming a more explicit frame for what many teams were converging on. OpenAI had published <a href=\"https://openai.com/index/harness-engineering/\">their article</a> on the topic, and others were describing similar patterns.<sup><a href=\"#user-content-fn-12\" id=\"user-content-fnref-12\" data-footnote-ref aria-describedby=\"footnote-label\">12</a></sup> The general idea is simple: agent reliability comes from the environment, not just the model, and verification infrastructure deserves primary investment. Seeing the pattern named made the next step clear.</p>\n<p>I took a day off — no sessions, no commits. Then I decided to do it properly.</p>\n<p>The next session started with a single instruction: review the repo in light of OpenAI's harness engineering article and suggest how it should be restructured. That kicked off a hard cutover. <code>pre_pr.sh</code> and <code>quality.py</code> gave way to a dedicated harness, diff-scoped mutation testing, and JSON output with bounded content for agent-friendly context windows. That was the harness turning point.</p>\n<h3>Loop, mutate, gate</h3>\n<p>The harness is a CLI that wraps the repo's verification into three progressively broader commands:</p>\n<pre><code class=\"language-bash\">uv run python scripts/harness.py loop          # fast scoped, 180s budget\nuv run python scripts/harness.py mutate        # diff-scoped mutation\nuv run python scripts/harness.py gate          # full blocking merge gate\n</code></pre>\n<p>The agent no longer needs to know what verification steps exist or where they live. It runs a command and gets a go or no-go result.</p>\n<HarnessWorkflowFigure />\n<p><code>loop</code> is the tight inner cycle. It figures out what changed, expands through a dependency graph to find affected packages, and runs only the relevant checks — lint, type-checking, tests, quality gates — under a 180-second budget. If it runs out of time, it defers remaining checks rather than failing. In practice, this changed the agent's behavior: instead of running the full test suite after every edit — or worse, running nothing until the end — the agent started running <code>loop</code> after each logical change, catching issues while the context was still fresh.</p>\n<p><code>mutate</code> answers a different question: do the tests actually verify the changed behavior, or do they just execute it?</p>\n<p><code>gate</code> is the full merge requirement. All checks, no scoping, no budget. Everything must pass.</p>\n<p>At the time of writing, the harness runs dozens of checks across those three commands. Some of the most useful exist because of specific agent behaviors. The suppression guard blocks new <code># type: ignore</code> and <code># noqa</code> annotations — without it, the agent's first instinct when a type check fails is to suppress the error rather than fix it. The import boundary checker enforces architectural layering with AST-level analysis.</p>\n<p>Every command returns a JSON envelope with bounded content — at most 20 checks, 8 details per check, 240 characters per detail — so the agent's context window is not flooded with log output. A typical response looked like this (simplified):</p>\n<pre><code class=\"language-json\">{\n  \"ok\": false,\n  \"command\": \"harness loop\",\n  \"result\": {\n    \"scope\": { \"expanded_scopes\": [\"core\", \"grad\", \"numpy\", \"...\"] },\n    \"checks\": [\n      {\n        \"id\": \"pytest_scoped\",\n        \"status\": \"fail\",\n        \"details\": [\"FAILED test_tracer.py::test_record - AssertionError\"]\n      }\n    ],\n    \"summary\": { \"total\": 5, \"passed\": 4, \"failed\": 1 }\n  },\n  \"next_actions\": [{ \"command\": \"harness loop\", \"description\": \"Re-run after fix.\" }]\n}\n</code></pre>\n<p>The <code>next_actions</code> field is context-aware: after a successful <code>loop</code>, the harness suggests <code>mutate</code>; after a failure, it suggests the specific post-fix command.</p>\n<p>None of these techniques are new individually. Scope-aware test selection exists in Bazel, Nx, and plenty of CI tools. <a href=\"https://en.wikipedia.org/wiki/Mutation_testing\">Mutation testing</a> has been around for decades.<sup><a href=\"#user-content-fn-13\" id=\"user-content-fnref-13\" data-footnote-ref aria-describedby=\"footnote-label\">13</a></sup> The pieces needed to be wired together in a specific way: JSON output bounded for agent context windows, a progression from fast-and-scoped to slow-and-complete, and every check degrading toward strictness when anything is uncertain.<sup><a href=\"#user-content-fn-14\" id=\"user-content-fnref-14\" data-footnote-ref aria-describedby=\"footnote-label\">14</a></sup></p>\n<h3>What to check</h3>\n<p>The core mechanic that makes the harness practical is scope resolution — turning \"what files changed\" into \"what checks to run.\" It starts with <code>git diff origin/main</code> to get changed paths, matches each against a path map that routes file paths to one of nine package scopes, then expands through a dependency graph using BFS. In simplified form, the config looked like this:</p>\n<pre><code class=\"language-toml\">[scope.path_map]\n\"packages/core/\"            = \"core\"\n\"packages/numpy/\"           = \"numpy\"\n\"packages/grad/\"            = \"grad\"\n\"packages/grad_numpy/\"      = \"grad_numpy\"\n\"packages/xarray/\"          = \"xarray\"\n\n[scope.dependencies]\ncore       = []\ngrad       = [\"core\"]\nnumpy      = [\"core\"]\ngrad_numpy = [\"core\", \"grad\", \"numpy\"]\nxarray     = [\"core\", \"numpy\"]\n</code></pre>\n<p>Change a file in <code>packages/core</code> and BFS expands to all dependent packages. Change only <code>packages/xarray</code> and only xarray checks run. Anything unrecognized — unmapped paths, missing merge-base, diff failures — falls back to the full gate.</p>\n<p>This is what enables the 180-second loop budget. Without scoped checks, every change triggers every check — too slow for a tight agent loop. With scoping, a change to one leaf package triggers seconds of verification, not minutes.</p>\n<h3>No survivors</h3>\n<p>Coverage measures execution. <a href=\"https://en.wikipedia.org/wiki/Mutation_testing\">Mutation testing</a> measures verification.</p>\n<p>An agent can (and will) write a test like this:</p>\n<pre><code class=\"language-python\">def test_gradient_scaling():\n    result = scale_gradient(x, factor=0.5)\n    assert result is not None  # 100% coverage, 0% verification\n</code></pre>\n<p>That test covers every line of <code>scale_gradient</code>. But flip <code>factor > 0</code> to <code>factor >= 0</code> inside the function, and the test still passes. It is not testing the behavior it claims to cover.</p>\n<MutationPipelineFigure />\n<p>The idea came from a Slack conversation with <a href=\"https://engineering.flexcompute.com/authors/frederik-schubert\">Frederik</a>: mutation testing is expensive on the whole codebase, but what if you scope it to the diff? Mutating only the five to twenty lines you just changed makes the cost manageable. The pipeline: find changed source lines via <code>git diff</code>, filter to lines actually executed by tests via coverage, generate AST-level mutations (comparison flips, boolean inversions, arithmetic swaps, constant flips), select at most twelve mutants with breadth across files, and run the relevant tests against each. If the tests still pass after a mutation, the mutant survived — the tests do not verify the behavior they claim to cover.</p>\n<p>The policy is strict: 100% kill rate on changed lines. If any mutant survives, the PR is blocked. And <code>require_changed_tests = true</code> adds another constraint: if you change runtime code, you must also change or add tests. No silent runtime changes. The mutation check then verifies that those tests actually catch real behavioral differences, not just execute code paths.</p>\n<p>Pure refactors and equivalent mutants occasionally cause friction. The tradeoff is worth it when the entity writing your tests is an AI that optimizes for making them pass rather than making them meaningful. Diff-scoping is what makes it practical. Full-codebase mutation testing is a research project. Diff-scoped mutation testing is a CI check.</p>\n<h2>The new normal</h2>\n<p>What changed after the harness is not that the agent writes better code. What changed is that bad code gets caught — missing tests, broken boundaries, skipped steps, silent regressions — before it compounds.</p>\n<p>Across 287 commits from November 27, 2025 to February 24, 2026, with February 16 as the harness cutover, the share of commits touching test files rose from about 41% to 76%, consistent with <code>require_changed_tests</code>.<sup><a href=\"#user-content-fn-15\" id=\"user-content-fnref-15\" data-footnote-ref aria-describedby=\"footnote-label\">15</a></sup> The fix ratio held roughly flat at about 11%. Commits also got larger — averaging about 1,187 insertions versus 668 pre-harness — suggesting more confidence in landing bigger changes when the harness catches mistakes.</p>\n<p>But the more interesting change was behavioral. I had always asked the agent to propose priorities — that ritual started early. But before the harness, I treated the proposals as suggestions and drove each session with specific goals: \"implement this JVP,\" \"fix the module layout,\" \"split these files.\" After, I could actually follow the agent's lead. Most sessions started the same way — \"what should we work on next?\" — and this time I meant it. The default shifted from \"I tell you what to build\" to \"you tell me what needs building, and I decide whether to approve.\"</p>\n<p>That shift felt strange the first few times. Asking an AI \"what should we work on?\" and actually trusting the answer requires believing the floor will catch whatever goes wrong. The harness was that floor. It could not prevent the agent from writing mediocre abstractions or introducing unnecessary complexity — architectural judgment still requires a human. But it could enforce that tests existed, that they verified behavior, that imports respected boundaries, that type annotations were real<sup><a href=\"#user-content-fn-16\" id=\"user-content-fnref-16\" data-footnote-ref aria-describedby=\"footnote-label\">16</a></sup>, that suppressions were not growing. The minimum quality was no longer me. It was the tooling.</p>\n<p>The relay pattern from the earlier sprints would not have scaled without it. Parallel agents exacerbate every problem a single agent has — more drift, more skipped steps, more context confusion — and the harness was what kept them honest.</p>\n<p>When something went wrong, the response changed too. Finding a bug in <code>np.pad</code> tracing, I did not just ask for a fix. I also asked why it slipped through, what would have prevented it, and how the instructions or harness should change so the same class of bug would not recur. Every bug became a harness improvement opportunity. After feature sessions, I started asking: was the harness useful? what did it catch? did mutation tests surface anything? They almost always had — mutation testing reliably catches weak tests that the agent wrote to pass, not to verify. The system learned from its failures, not through any kind of machine learning, but through a human treating each failure as evidence that a check was missing.<sup><a href=\"#user-content-fn-17\" id=\"user-content-fnref-17\" data-footnote-ref aria-describedby=\"footnote-label\">17</a></sup></p>\n<h2>What transfers</h2>\n<p>This implementation is purely Python, and that shows in the choice of technologies. AST-based mutation testing, pytest-driven coverage, ruff and mypy as lint and type gates — these are ecosystem-specific. The implementation is shaped by its context. But the exercise transfers: figure out what your agent keeps getting wrong, build a check for it, and wire that check into a loop the agent cannot skip. The specific checks will differ. The discipline should not.</p>\n<p>Start with the check that would have caught the last thing that went wrong. The harness will grow from there.</p>\n<p>I built this library intentionally as a learning experience for myself — paying attention to the patterns, documenting what went wrong, formalizing each fix. I suspect much of it will be familiar to anyone who has spent real time building with agents. The stacks and checks will vary. The habit should not: when the agent fails, turn the failure into a constraint, a check, or a better handoff, then keep going.</p>\n<section data-footnotes class=\"footnotes\"><h2 class=\"sr-only\" id=\"footnote-label\">Footnotes</h2>\n<ol>\n<li id=\"user-content-fn-1\">\n<p>This article was written with AI assistance. I dictated raw thoughts using dictation software, then worked with Claude to turn them into prose — iterating paragraph by paragraph and pushing back whenever something did not sound like me. The research was AI-assisted too: I used a local transcript indexing and search tool, and had agents crawl through three months of git history to surface the timelines and stats behind the claims here. The result is more thoroughly researched than what I would have produced on my own, which is part of the point. <a href=\"#user-content-fnref-1\" data-footnote-backref=\"\" aria-label=\"Back to reference 1\" class=\"data-footnote-backref\">↩</a></p>\n</li>\n<li id=\"user-content-fn-2\">\n<p><a href=\"https://github.com/HIPS/autograd\">Autograd</a> by Maclaurin, Duvenaud, and Johnson is the original NumPy autodiff library — elegant, influential, and the reason most of us know reverse-mode AD can feel native to Python. I had used it extensively, and it was the main reference point in my head. <a href=\"https://github.com/rsokl/MyGrad\">MyGrad</a> by Ryan Soklaski takes a different approach — a Tensor object with NumPy ufunc/function overrides rather than autograd's tracing tape — and its Hypothesis-heavy testing style directly influenced this library's. <a href=\"#user-content-fnref-2\" data-footnote-backref=\"\" aria-label=\"Back to reference 2\" class=\"data-footnote-backref\">↩</a></p>\n</li>\n<li id=\"user-content-fn-3\">\n<p>Autodiff is the perfect application for property-based testing — you can express real mathematical invariants (gradient correctness, chain rule composition, forward/reverse agreement) and let the framework try to break them. <a href=\"#user-content-fnref-3\" data-footnote-backref=\"\" aria-label=\"Back to reference 3\" class=\"data-footnote-backref\">↩</a></p>\n</li>\n<li id=\"user-content-fn-4\">\n<p>The plans were remarkably specific. A typical one would include exact file paths, function signatures, test case names, and threshold constants — for example, <code>AUTO_MIN_NODES_FOR_ANY_OPT = 128</code>, <code>AUTO_MIN_NODES_FOR_CSE = 512</code>. The agent produced these from its analysis of the codebase; I reviewed and approved. <a href=\"#user-content-fnref-4\" data-footnote-backref=\"\" aria-label=\"Back to reference 4\" class=\"data-footnote-backref\">↩</a></p>\n</li>\n<li id=\"user-content-fn-5\">\n<p>Autograd was the reference point I had in mind when I was sanity-checking the agent's parity claim. <a href=\"#user-content-fnref-5\" data-footnote-backref=\"\" aria-label=\"Back to reference 5\" class=\"data-footnote-backref\">↩</a></p>\n</li>\n<li id=\"user-content-fn-6\">\n<p>My best guess at the cause: post-training fine-tuning rewards \"safe\" code — strong compatibility guarantees, no breaking changes, defensive patterns. The training data is overwhelmingly production code with real users, where preserving backward compatibility is the right default. The model has no way to distinguish that context from a greenfield repo with zero consumers. Sensible instinct, wrong situation. <a href=\"#user-content-fnref-6\" data-footnote-backref=\"\" aria-label=\"Back to reference 6\" class=\"data-footnote-backref\">↩</a></p>\n</li>\n<li id=\"user-content-fn-7\">\n<p>For readers who want to dive deeper: JAX's <a href=\"https://docs.jax.dev/en/latest/notebooks/autodiff_cookbook.html\">Autodiff Cookbook</a> is an excellent introduction to forward- and reverse-mode differentiation, and their <a href=\"https://docs.jax.dev/en/latest/notebooks/Custom_derivative_rules_for_Python_code.html\">Custom derivative rules</a> page explains exactly the JVP-first approach adopted here — define a JVP rule, and the framework derives VJPs automatically by transposing the linear computation. <a href=\"#user-content-fnref-7\" data-footnote-backref=\"\" aria-label=\"Back to reference 7\" class=\"data-footnote-backref\">↩</a></p>\n</li>\n<li id=\"user-content-fn-8\">\n<p>The naive question is: why not just keep them in sync? In practice, a growing codebase accumulates multiple places that encode the same information — agent instructions, CI configs, contributor docs, READMEs — and no matter how explicit the instructions are, they drift. Each source gets updated in its own context, by a different session or a different agent, and nobody notices the divergence until something breaks. This is hard to stay on top of even with human contributors; with agents that read whatever file they find first, it compounds fast. The harness solved this by making the single CLI the only source of truth — AGENTS.md says \"run harness loop,\" CI runs \"harness gate,\" and neither needs to enumerate individual steps. <a href=\"#user-content-fnref-8\" data-footnote-backref=\"\" aria-label=\"Back to reference 8\" class=\"data-footnote-backref\">↩</a></p>\n</li>\n<li id=\"user-content-fn-9\">\n<p>Both Claude Code and Codex CLI had automatic context compaction by this point — summarizing conversation history when the context window fills up. The relay pattern solves a different problem. Auto-compaction tries to preserve everything; the relay deliberately discards, starting a fresh session with only what the human judges relevant. The compression is lossy by design — that is the point. <a href=\"#user-content-fnref-9\" data-footnote-backref=\"\" aria-label=\"Back to reference 9\" class=\"data-footnote-backref\">↩</a></p>\n</li>\n<li id=\"user-content-fn-10\">\n<p>Around this time, I shifted most implementation work from Claude Code to Codex. Different tools for different strengths — Codex sessions averaged five hours for heavy, instruction-driven implementation; Claude Code sessions averaged under two hours for analysis, architecture review, and focused tasks. <a href=\"#user-content-fnref-10\" data-footnote-backref=\"\" aria-label=\"Back to reference 10\" class=\"data-footnote-backref\">↩</a></p>\n</li>\n<li id=\"user-content-fn-11\">\n<p>This library's graph-based tracing is inherently heavier than autograd's flat tape — you pay for node allocation, edge management, and scope tracking on every operation. Early benchmarks showed very large overhead. A dedicated performance sprint across several parallel workstreams brought this down significantly, and a 10x ratio was set as the acceptable threshold: slow enough to reflect the architectural cost, fast enough to be usable. The threshold was wired into <code>quality.py</code> as a blocking gate — which worked until measurement noise on higher-order workloads pushed it past 10x on runs where nothing performance-related had changed. <a href=\"#user-content-fnref-11\" data-footnote-backref=\"\" aria-label=\"Back to reference 11\" class=\"data-footnote-backref\">↩</a></p>\n</li>\n<li id=\"user-content-fn-12\">\n<p>The term gained traction in early 2026. Mitchell Hashimoto's <a href=\"https://mitchellh.com/writing/my-ai-adoption-journey\">\"My AI Adoption Journey\"</a> described the practice of engineering a solution for every agent mistake so it never recurs — each line in his AGENTS.md traced to a specific past failure. OpenAI's <a href=\"https://openai.com/index/harness-engineering/\">\"Harness engineering\"</a> made the case at scale using Codex. Anthropic demonstrated it by having <a href=\"https://www.anthropic.com/engineering/building-c-compiler\">sixteen parallel Claude agents build a C compiler</a> in Rust. Cursor showed what happens at the extreme — <a href=\"https://cursor.com/blog/scaling-agents\">agents building a browser from scratch</a>, running unattended for a week. Martin Fowler's <a href=\"https://martinfowler.com/articles/exploring-gen-ai/harness-engineering.html\">analysis</a> provided conceptual framing. Can Duruk's <a href=\"https://blog.can.ac/2026/02/12/the-harness-problem/\">\"The Harness Problem\"</a> argued the harness is the bottleneck, not the model. <a href=\"#user-content-fnref-12\" data-footnote-backref=\"\" aria-label=\"Back to reference 12\" class=\"data-footnote-backref\">↩</a></p>\n</li>\n<li id=\"user-content-fn-13\">\n<p>Google runs diff-based mutation testing on every code change to its monorepo, using the same core idea: generate mutants only in changed lines, use coverage data to select relevant tests, suppress unproductive mutations. Their system serves tens of thousands of developers. See Petrovic and Ivankovic, <a href=\"https://research.google/pubs/pub46584/\">\"State of Mutation Testing at Google\"</a> (ICSE 2018). <a href=\"#user-content-fnref-13\" data-footnote-backref=\"\" aria-label=\"Back to reference 13\" class=\"data-footnote-backref\">↩</a></p>\n</li>\n<li id=\"user-content-fn-14\">\n<p>This is the <a href=\"https://engineering.flexcompute.com/articles/agent-control-loop#unenforced-verification\">\"fail closed\" principle</a> from the first article: when the system cannot determine whether something is safe, it should assume it is not. Every ambiguity resolves toward more checking, not less. <a href=\"#user-content-fnref-14\" data-footnote-backref=\"\" aria-label=\"Back to reference 14\" class=\"data-footnote-backref\">↩</a></p>\n</li>\n<li id=\"user-content-fn-15\">\n<p>\"Touched test files\" means the commit's diffstat includes at least one file under a <code>tests/</code> directory. \"Fix\" commits are classified by conventional commit prefix (<code>fix:</code>). Insertions are raw <code>git log --stat</code> totals. <a href=\"#user-content-fnref-15\" data-footnote-backref=\"\" aria-label=\"Back to reference 15\" class=\"data-footnote-backref\">↩</a></p>\n</li>\n<li id=\"user-content-fn-16\">\n<p>In a typed codebase, agents do not have to guess what a function expects or returns. Types are how agents navigate a large codebase without reading every implementation. In Python that discipline is optional rather than enforced by the language, so tools like mypy strict have to carry the load. <a href=\"#user-content-fnref-16\" data-footnote-backref=\"\" aria-label=\"Back to reference 16\" class=\"data-footnote-backref\">↩</a></p>\n</li>\n<li id=\"user-content-fn-17\">\n<p>This extended to the agent instructions themselves. Early on, I established a rule that <code>AGENTS.md</code> should self-update: if an agent followed a rule and it still led to the wrong outcome, the rule needed to be refined. The instructions file became a living document maintained by the agents who consumed it. <a href=\"#user-content-fnref-17\" data-footnote-backref=\"\" aria-label=\"Back to reference 17\" class=\"data-footnote-backref\">↩</a></p>\n</li>\n</ol>\n</section>",
      "attachments": [
        {
          "url": "https://engineering.flexcompute.com/articles/what-should-we-work-on-next.md",
          "mime_type": "text/markdown",
          "title": "\"What Should We Work On Next?\" markdown"
        },
        {
          "url": "https://engineering.flexcompute.com/images/og/what-should-we-work-on-next.png",
          "mime_type": "image/png",
          "title": "\"What Should We Work On Next?\" social image"
        }
      ],
      "_flexcompute": {
        "kind": "Case Study",
        "tags": [
          "ai-engineering",
          "autodiff",
          "verification"
        ],
        "series": "AI Engineering",
        "series_order": 2,
        "markdown_url": "https://engineering.flexcompute.com/articles/what-should-we-work-on-next.md"
      }
    },
    {
      "id": "https://engineering.flexcompute.com/articles/agent-control-loop/",
      "url": "https://engineering.flexcompute.com/articles/agent-control-loop/",
      "title": "The Agent Control Loop — Engineering for Tolerance",
      "summary": "Why agent reliability isn't magic model behavior — it's an environment where correctness is continuously verified. A framework for deciding when and how to delegate to AI agents.",
      "image": "https://engineering.flexcompute.com/images/og/agent-control-loop.png",
      "banner_image": "https://engineering.flexcompute.com/images/og/agent-control-loop.png",
      "date_published": "2026-01-19T00:00:00.000Z",
      "date_modified": "2026-01-19T00:00:00.000Z",
      "authors": [
        {
          "name": "Frederik Schubert",
          "path": "/authors/frederik-schubert/",
          "url": "https://engineering.flexcompute.com/authors/frederik-schubert/"
        },
        {
          "name": "Yannick Augenstein",
          "path": "/authors/yannick-augenstein/",
          "url": "https://engineering.flexcompute.com/authors/yannick-augenstein/"
        }
      ],
      "tags": [
        "AI Engineering",
        "AI Agents",
        "Verification"
      ],
      "content_html": "<p>Consider two recent experiments with coding agents. Similar ambition. Opposite outcomes.</p>\n<p>In the <a href=\"https://cursor.com/blog/scaling-agents\">first</a>, a team pointed hundreds of agents at a browser project. In a week they produced roughly a million lines of code. By coordination metrics it was a success: parallel work, lots of merged PRs, visible throughput. But when the project went public, <a href=\"https://embedding-shapes.github.io/cursor-implied-success-without-evidence/\">outside observers pointed to failing CI and questioned how much of that visible throughput translated into a clean, working system</a>.</p>\n<p>In the second, as described in <a href=\"https://approachwithalacrity.com/p/claude-is-not-a-senior-engineer-yet\"><em>Claude is not a senior engineer (yet)</em></a>, a single engineer connected Claude to an automated browser testing suite (Playwright) and an error monitoring tool (Sentry). The agent wrote code, ran the tests, read the error traces, and fixed its own bugs. Ninety minutes later, it worked.</p>\n<p>I am not trying to offer a definitive postmortem on either case. I am using them as contrasting examples of a broader engineering pattern.</p>\n<p>The pattern is a standard engineering concept: <strong>tolerance.</strong></p>\n<h2>Tolerance: How much drift can you afford?</h2>\n<p>Mechanical engineering abandoned binary \"works/doesn't work\" thinking decades ago. A bridge doesn't just \"work\". It tolerates a specific load variance under specific conditions. We ask about allowable margin of error, <em>i.e.</em>, the acceptable error band around the ideal.</p>\n<p>Software <em>engineering</em> has tolerances, too.</p>\n<p>Tolerance isn't one number. In software it decomposes into dimensions like correctness, security, latency, cost, reversibility, and blast radius (think error budgets). UI copy may be flexible on exact wording but not on brand tone. A refactor may tolerate new implementation details but not behavior changes.</p>\n<p>Some tasks are <strong>high tolerance</strong>: exploratory prototyping, quick internal tools, one-off scripts, early drafts. Drift is acceptable because the goal is discovery and speed.</p>\n<p>Other tasks are <strong>low tolerance</strong>: production infrastructure, security boundaries, customer-facing behavior, billing and permissions. Here \"close enough\" isn't a solution. It's a failure that may not show up immediately, but will surface later as incidents and churn.</p>\n<p>The difference between those two agent experiments was that one treated a low-tolerance problem with a high-tolerance process.</p>\n<p>That mismatch shows up as a control problem.</p>\n<h2>Open Loops vs. Closed Loops</h2>\n<p>In an <strong>open loop</strong>, the agent writes code and opens pull requests, but verification bottlenecks at human review minutes, hours, or days later. The delay between action and verification lets error accumulate. Drift becomes visible only after it's expensive.</p>\n<p>In a <strong>closed loop</strong>, the agent makes a change and immediately runs verification against hard constraints. The loop itself damps error. Closed loops require fast, reliable feedback; slow or flaky verification re-opens the loop.</p>\n<h3>The Two Experiments Compared</h3>\n<table>\n<thead>\n<tr>\n<th></th>\n<th>Browser Project (Open Loop)</th>\n<th>LLM + Tests + Traces (Closed Loop)</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td><strong>Feedback signal</strong></td>\n<td>Informational (throughput, mergeability, activity)</td>\n<td>Structural (tests pass, errors resolved)</td>\n</tr>\n<tr>\n<td><strong>Verification timing</strong></td>\n<td>After the fact, by humans</td>\n<td>Every turn, by the agent</td>\n</tr>\n<tr>\n<td><strong>Termination condition</strong></td>\n<td>PR merged</td>\n<td>Constraints satisfied</td>\n</tr>\n<tr>\n<td><strong>Outcome</strong></td>\n<td>Implied success; required human fixes</td>\n<td>Working fix in 90 minutes</td>\n</tr>\n</tbody>\n</table>\n<p>Read this way, the browser project optimized for coordination metrics, while the Claude setup optimized for correctness under feedback. The point is not that many agents are inherently bad. The point is that open loops amplify drift when verification is weak. <strong>Agent reliability isn't magic model behavior</strong>, it's an environment where correctness is continuously verified.</p>\n<p>But verification has a precondition: success must be expressible as constraints the agent can actually verify.</p>\n<blockquote>\n<p>For a picture of what the browser experiment <em>could</em> have looked like with human coordination and a tight loop, see <a href=\"https://emsh.cat/one-human-one-agent-one-browser/\">One Human + One Agent = One Browser From Scratch</a>.</p>\n</blockquote>\n<h2>The Ambiguity Gap</h2>\n<p>The fundamental challenge with agents is that <strong>intent</strong> is latent (in your head), while <strong>evidence</strong> is explicit (text documents in the repo).</p>\n<p><strong>Ambiguity is the distance between intent and evidence.</strong></p>\n<p>When you delegate to a human engineer, they bridge that gap with judgment: they ask clarifying questions, infer missing context, notice anomalies, and sanity-check against domain knowledge.</p>\n<p>When you delegate to an AI agent, it can't feel that gap. It needs <strong>measurable constraints</strong> to know whether it has actually crossed from \"plausible output\" to \"correct result.\"</p>\n<p>Without constraints, the agent will still do as instructed. It will just optimize for the easiest proxy it can satisfy: producing output, closing tickets, merging PRs. Not correctness and integration.</p>\n<p>A practical predictor of these outcomes is:</p>\n<p><strong>Can you describe success in terms of constraints the agent can verify?</strong></p>\n<p>If you can, agents compound your effort. If you can't, you're doing exploration, and you should treat the output as exploration.</p>\n<h2>Four Failure Modes (and their fixes)</h2>\n<p>When agents appear unreliable, it's usually a failure of the <em>surrounding system design</em> rather than the model itself. We see four common patterns.</p>\n<ol className=\"failure-modes\">\n  <li id=\"undefined-specs\">\n    <p className=\"failure-mode__title\">Undefined Specs</p>\n    <p>\n      You have intent, but no mechanism to verify it. The requirements are unsettled, or the\n      definition of done is just a feeling, like \"make onboarding feel simpler.\"\n    </p>\n    <p className=\"failure-mode__fix\">\n      <strong>Fix:</strong> Don't delegate the decision-making. Use the agent to prototype and\n      explore, but treat the output as raw material that helps you write the spec, <em>not</em> as\n      the final product. If you don't know what done looks like, the agent won't either.\n    </p>\n  </li>\n  <li id=\"hidden-context\">\n    <p className=\"failure-mode__title\">Hidden Context</p>\n    <p>\n      The constraints exist, but they're trapped in a meeting note or a Slack thread. Unlike\n      undefined specs, the spec exists here. It just isn't where the agent can read it. Think of the\n      edge-case permission rule that came up once in discussion but never made it into the repo.\n    </p>\n    <p className=\"failure-mode__fix\">\n      <strong>Fix:</strong> Treat context as code. If a constraint isn't captured in versioned,\n      linkable artifacts (<code>AGENTS.md</code>, RFCs/ADRs, schemas), it doesn't exist for the\n      agent.\n    </p>\n  </li>\n  <li id=\"unenforced-verification\">\n    <p className=\"failure-mode__title\">Unenforced Verification</p>\n    <p>\n      The specs exist and are accessible, but the agent isn't forced to check them. Tests are nice\n      to have. CI failures don't block merges. The system rewards speed or volume over correctness.\n      The result is a workflow where \"it probably works\" is treated as progress.\n    </p>\n    <p className=\"failure-mode__fix\">\n      <strong>Fix:</strong> Verification must be a termination condition. CI gates must fail closed.\n      Pre-commit hooks tighten the loop. If the tests don't pass, the agent hasn't finished.\n    </p>\n  </li>\n  <li id=\"inadequate-constraints\">\n    <p className=\"failure-mode__title\">Inadequate Constraints</p>\n    <p>\n      The agent is verifying, but the constraints are too weak or too game-able. Tests pass, yet the\n      system is still wrong: coverage is thin, assertions encode the wrong intent, or non-functional\n      requirements (performance, security, UX) aren't represented. For example, unit tests stay\n      green while latency quietly doubles.\n    </p>\n    <p className=\"failure-mode__fix\">\n      <strong>Fix:</strong> Widen the constraint surface. Add invariants and golden tests for\n      critical flows, static analysis (types, linters), and where it matters, property\n      tests/fuzzing. For production-adjacent changes, pair verification with observability and\n      rollback criteria.\n    </p>\n  </li>\n</ol>\n<h2>Deciding What to Delegate</h2>\n<p>As models get smarter and faster, and context windows expand, the temptation is to throw them at larger, fuzzier problems. But a smarter, faster agent in a fuzzy environment mostly produces the wrong thing faster. It cannot know what it cannot read.</p>\n<p>To decide whether to delegate, ask:</p>\n<ul>\n<li>Can the agent verify success on its own?</li>\n<li>How much drift can you tolerate if it gets the answer slightly wrong?</li>\n</ul>\n<p>That gives four cases:</p>\n<ol>\n<li><strong>Verifiable, high tolerance.</strong> Let it run and spot-check. Examples: generating release notes from merged PRs; drafting meeting notes.</li>\n<li><strong>Verifiable, low tolerance.</strong> Delegate with gates. Examples: fixing a failing unit test; fixing a customer-reported bug.</li>\n<li><strong>Not yet verifiable, high tolerance.</strong> Use the agent for exploration, then extract constraints from what you learn. Examples: exploring UI layouts for a new feature; brainstorming marketing copy.</li>\n<li><strong>Not yet verifiable, low tolerance.</strong> Don't delegate the decision yet. First use the agent to produce the artifacts that make the work verifiable. Examples: draft a permission matrix, define invariants, write escalation-path tests, prototype policy-as-code.</li>\n</ol>\n<p>High-level work (architecture, strategy, trade-offs) often starts in the hardest case: low verifiability, low tolerance. Assumptions hide best there. But high-level work is usually decomposable. Break it into constrained subtasks, then delegate those.</p>\n<p>For example, \"defining multi-tenant permissions\" is low tolerance and low verifiability at the start. Don't delegate the decision; delegate the work of making it verifiable: draft a permission matrix and invariants, write tests for escalation paths, prototype a policy-as-code layer. Once those constraints exist, implementation becomes a low-tolerance but verifiable task.</p>\n<h2>The Acceleration of Debt</h2>\n<p>None of this is new engineering wisdom. What changes with agents is the rate at which small omissions compound.</p>\n<p>Humans bridge gaps socially: they ask questions, notice contradictions, remember \"that one incident from last year,\" and hesitate when something feels off. Agents don't get those dampeners. They will happily produce plausible work until the system forces contact with reality.</p>\n<p>That's why what used to be technical debt becomes <em>context failure</em>. If a constraint (decision records, schemas, invariants, style guides, runbooks) isn't captured in the repo, it effectively doesn't exist for an agent. <strong>Treat context as code</strong>, and treat verification as the termination condition, not a suggestion.</p>\n<p>The future workflow isn't exotic. It's the old best practices, made load-bearing by speed. Start with the last thing your agent got wrong. Turn it into a constraint or a check. Wire it into the loop, then repeat.</p>",
      "attachments": [
        {
          "url": "https://engineering.flexcompute.com/articles/agent-control-loop.md",
          "mime_type": "text/markdown",
          "title": "The Agent Control Loop — Engineering for Tolerance markdown"
        },
        {
          "url": "https://engineering.flexcompute.com/images/og/agent-control-loop.png",
          "mime_type": "image/png",
          "title": "The Agent Control Loop — Engineering for Tolerance social image"
        }
      ],
      "_flexcompute": {
        "kind": "Essay",
        "tags": [
          "ai-engineering",
          "agents",
          "verification"
        ],
        "series": "AI Engineering",
        "series_order": 1,
        "markdown_url": "https://engineering.flexcompute.com/articles/agent-control-loop.md"
      }
    }
  ]
}