---
title: "Predicting Peak Memory for an Electromagnetic Mode Solver"
description: "How we replaced a heuristic memory estimate with a calibrated model for Tidy3D mode solver workloads, eliminating under-predictions across the calibration set."
canonical_url: "https://engineering.flexcompute.com/articles/mode-solver-memory-calibration/"
markdown_url: "https://engineering.flexcompute.com/articles/mode-solver-memory-calibration.md"
image: "https://engineering.flexcompute.com/images/og/mode-solver-memory-calibration.png"
publish_date: "2026-04-15"
updated_date: "2026-04-15"
kind: "Case Study"
authors:
  - "Momchil Minkov"
tags:
  - "Photonics"
  - "Tidy3D"
  - "Verification"
---

import MemoryScatterFigure from '../../components/MemoryScatterFigure.astro';

Before launching a mode solve, the platform has to choose a machine and a parallelism strategy without yet knowing the solve's true peak memory. That estimate depends less on the sparse matrix size than on what happens around it: fill-in during factorization, whether the problem is real or complex, how much state forked workers can share, and how MPI overhead grows with rank count.

Our original estimate was a heuristic, and it missed in both directions. Some jobs OOMed on the first worker and had to be retried on a larger one; others landed on larger machines than they needed from the start. We replaced it with a calibrated model built from measured runs. On the calibration dataset, it removed under-predictions and pulled small-job estimates closer to reality.

<div class="article-overview">
  <p class="article-overview__eyebrow">At a glance</p>
  <div class="article-overview__grid">
    <section>
      <h3>Goal</h3>
      <p>Predict peak mode-solver memory before launch well enough to choose machines safely.</p>
    </section>
    <section>
      <h3>Method</h3>
      <p>
        Fit an affine model, detect matrix regime from metadata, and model copy-on-write and MPI
        overhead explicitly.
      </p>
    </section>
    <section>
      <h3>Outcome</h3>
      <p>Zero under-predictions on the calibration set, with less excess padding on small jobs.</p>
    </section>
  </div>
</div>

## Why the Old Estimate Broke

A mode solver finds the guided electromagnetic modes of a waveguide by assembling a sparse eigenvalue problem and extracting a small number of eigenpairs with shift-invert iteration. The expensive step is LU factorization of the shifted matrix $A - \sigma I$. Most of the peak memory comes from fill-in during that factorization, not from the original sparse matrix itself.

The scheduler has to estimate that peak before the solve starts, because it needs to choose both a machine and a parallelism plan. Two execution modes make that harder:

- Frequency parallel: a Python `multiprocessing.Pool` of `N` workers, each solving one frequency independently. Total memory is not `N` times a single solve because forked workers share the parent process image via copy-on-write.
- MPI parallel: a distributed direct solver that uses `K` MPI ranks for the LU factorization. Overhead scales sublinearly with rank count.

The old estimate knew about these backends in a coarse way, but it missed several effects that turned out to matter. It did not distinguish lossless from lossy materials, or PML from non-PML boundaries, even though either can double or quadruple the effective matrix size. It also ignored mode count, sequential frequency accumulation, and copy-on-write sharing. In practice, it was optimistic in some regions and overly padded in others.

<MemoryScatterFigure
  id="legacy-scatter"
  variant="legacy"
  captionTitle="Legacy model."
  captionBody="On the calibration dataset, many runs sit above the diagonal, meaning the system would allocate less memory than the solver actually needed."
/>

<div class="article-note article-note--warning">
  <p class="article-note__label">Why the baseline failed</p>
  <p>
    <strong>Many runs sit above the diagonal.</strong> Each such point is a job the scheduler would
    have started on too little memory, forcing a retry on a larger worker.
  </p>
</div>

## What Drives Memory in a Sparse Eigensolver

The shift-invert eigensolver builds a sparse matrix and factors it. Memory is dominated by fill-in: the additional non-zero entries created during LU factorization. For a 2D modal cross-section with $N$ grid points, the matrix structure depends on the physics:

| Regime         | Typical trigger                                 | Effective matrix form                               | Relative weight                                         |
| -------------- | ----------------------------------------------- | --------------------------------------------------- | ------------------------------------------------------- |
| Scalar real    | Lossless materials without PML                  | `2N x 2N` real                                      | Baseline                                                |
| Scalar complex | Lossy materials or PML boundaries               | `2N x 2N` complex, often expanded to `4N x 4N` real | Roughly `4x` the scalar-real non-zeros                  |
| Tensorial      | Fully anisotropic media or off-axis propagation | `4N x 4N` complex                                   | About `5x` the scalar-real memory at the same grid size |

$N$ can range from a few thousand to nearly a million, and the matrix dimension determines fill-in volume. A flat per-point heuristic will be right for one regime and wrong for the others.

## Predicting Matrix Structure From Metadata

Choosing the right memory model before launch means inferring two things from the simulation definition alone: whether the problem is tensorial, and whether the matrix will be complex. We have to get both right before assembling anything.

<div class="article-key-grid article-key-grid--duo">
  <section class="article-key-card">
    <p class="article-key-card__eyebrow">Tensorial detection</p>
    <h3>Angle is not enough</h3>
    <p>
      Treat the problem as tensorial for fully anisotropic media, or for off-axis propagation when
      <code>angle_rotation</code> is not active.
    </p>
    <p>
      But <code>angle_rotation</code> can rotate the geometry back into an effectively scalar case,
      so angle alone can over-predict.
    </p>
  </section>
  <section class="article-key-card">
    <p class="article-key-card__eyebrow">Complex detection</p>
    <h3>PML counts too</h3>
    <p>
      Treat the matrix as complex if any material has imaginary permittivity, or if any boundary
      uses PML.
    </p>
    <p>
      PML introduces complex coordinate stretching even when every material is real, so missing it
      under-predicts.
    </p>
  </section>
</div>

This logic has to err on the side of safety. A conservative call wastes some capacity; an optimistic one can trigger an OOM and a retry. We checked the metadata-time prediction against the matrix structure the solver actually built at runtime across the full test suite, and treated any under-prediction as a bug.

## The Calibrated Model

### Calibration Harness

We built a harness to run hundreds of controlled solves while sweeping:

<div class="article-key-grid article-stat-grid">
  <section class="article-key-card">
    <p class="article-stat-grid__value">7k to 924k</p>
    <p class="article-stat-grid__label">Grid points</p>
  </section>
  <section class="article-key-card">
    <p class="article-stat-grid__value">1 to 10</p>
    <p class="article-stat-grid__label">Modes</p>
  </section>
  <section class="article-key-card">
    <p class="article-stat-grid__value">1 to 5</p>
    <p class="article-stat-grid__label">Frequencies</p>
  </section>
  <section class="article-key-card">
    <p class="article-stat-grid__value">1 to 5</p>
    <p class="article-stat-grid__label">Pool workers</p>
  </section>
  <section class="article-key-card">
    <p class="article-stat-grid__value">
      <code>scipy</code> and <code>MUMPS/PETSc</code>
    </p>
    <p class="article-stat-grid__label">Solver backends</p>
  </section>
  <section class="article-key-card">
    <p class="article-stat-grid__value">Lossless, lossy, PML, tensorial</p>
    <p class="article-stat-grid__label">Material regimes</p>
  </section>
</div>

Each experiment runs in a fresh subprocess so we get a clean resident set size (RSS) measurement for that solve, rather than allocator leftovers from earlier work. We used a separate MPI scaling dataset for larger problems, spanning 366k to 8.9M grid points and 1 to 28 MPI ranks.

### Model Structure

The final estimator has four parts:

<div class="article-process">
  <section class="article-process__step">
    <p class="article-process__index">01</p>
    <div>
      <h3>Fit a base affine model</h3>
      <p>
        Start from scalar real cases and fit intercept, per-point slope, and mode-count slope
        separately for each solver backend.
      </p>
    </div>
  </section>
  <section class="article-process__step">
    <p class="article-process__index">02</p>
    <div>
      <h3>Scale only the size-dependent terms</h3>
      <p>
        Apply the regime multiplier to the slopes, not the intercept, so small jobs do not inherit
        exaggerated startup overhead.
      </p>
    </div>
  </section>
  <section class="article-process__step">
    <p class="article-process__index">03</p>
    <div>
      <h3>Account for pool copy-on-write behavior</h3>
      <p>
        Model the first worker separately, then add only the incremental per-worker overhead that
        survives copy-on-write sharing.
      </p>
    </div>
  </section>
  <section class="article-process__step">
    <p class="article-process__index">04</p>
    <div>
      <h3>Add MPI overhead as a power law</h3>
      <p>
        For distributed solves, model extra rank cost separately and use different coefficients for
        real and complex matrices.
      </p>
    </div>
  </section>
</div>

The base model is affine, fitted on scalar real cases:

<p class="article-mini-heading">Base affine fit</p>

```text
single_solve_gb = intercept + pts_coeff * (pts / 1000) + modes_coeff * num_modes
```

We fit separate coefficients for each solver backend. The intercept, roughly 1 GB, captures Python and library startup overhead.

Tensorial and complex problems need a regime scale factor on the memory estimate, but it should apply only to the size-dependent terms. The intercept is startup overhead; it does not change with matrix structure.

Scaling the whole estimate inflates predictions by an order of magnitude at small sizes, where the intercept dominates, without buying anything at large sizes, where the slope dominates. Scaling only the growing terms kept small solves tighter while preserving headroom where it mattered.

<p class="article-mini-heading">Regime-aware scaling</p>

```text
scaled_solve = intercept + (pts_coeff * pts / 1000 + modes_coeff * num_modes) * regime_scale
```

For frequency-parallel execution with a pool of `N` workers, total memory accounts for copy-on-write sharing:

<p class="article-mini-heading">Frequency-parallel pool model</p>

```text
first_worker = scaled_solve + accumulation * (freqs_per_worker - 1)
incremental  = min(per_worker_overhead, first_worker)
total        = first_worker + (N - 1) * incremental + safety_buffer
```

The `accumulation` term captures memory growth when one worker handles multiple frequencies sequentially.

For MPI-parallel execution with `K` ranks, where frequencies are solved sequentially, the overhead follows a power law:

<p class="article-mini-heading">MPI overhead model</p>

```text
mpi_overhead = (alpha * pts_millions + beta) * (K - 1)^gamma
total        = scaled_solve + mpi_overhead + accumulation * (nf - 1) + safety_buffer
```

There are separate `(alpha, beta, gamma)` coefficients for real versus complex matrices because the distributed solver expands complex matrices to $4N \times 4N$ real, quadrupling fill-in.

### Allocator Retention in Forked Workers

When Python forks pool workers, the parent's memory is shared copy-on-write, so the cost of adding a worker is much smaller than duplicating a full solve. But the calibration data exposed a second effect: glibc malloc arena accumulation was inflating RSS across sequential solves handled by the same worker. Each frequency solve allocates and frees large sparse arrays, yet the allocator often keeps that memory mapped instead of returning it to the OS. After a few solves, the worker can look materially larger than a fresh process doing the same job. That polluted the calibration data, especially when one worker processed multiple frequencies, and led to scale factors that were too pessimistic.

Setting `maxtasksperchild=1` in the multiprocessing pool fixed it: each worker gets recycled after a single task. Re-running calibration produced cleaner data and tighter coefficients. Single-frequency or single-worker cases skip the pool entirely and run in-process.

<div class="article-note">
  <p class="article-note__label">Operational change</p>
  <p>
    <strong>`maxtasksperchild=1` became part of the runtime policy.</strong> Recycle each worker
    after one task so allocator arenas do not accumulate across sequential frequencies.
  </p>
</div>

### Indexing Limits in Sparse Factorization

Calibration also surfaced a failure that looked like a memory problem but wasn't. At large grid counts in the tensorial regime, the solver crashed with a cryptic error even though the machine still had plenty of RAM left.

The cause was a 32-bit integer overflow in the sparse factorization's internal indexing. For a tensorial $4N \times 4N$ matrix at large $N$, fill-in scales roughly as $O(n^{3/2})$ and can exceed the 2.15 billion non-zero limit of 32-bit indices. More RAM doesn't help.

We now validate the grid point count before submission and return a clear error, rather than letting it crash deep inside the factorization.

<div class="article-note article-note--warning">
  <p class="article-note__label">Hard limit</p>
  <p>
    <strong>This is an indexing ceiling, not a RAM ceiling.</strong> When a solver fails at a
    suspiciously round problem size, check the index types before adding more RAM.
  </p>
</div>

<MemoryScatterFigure
  id="calibrated-scatter"
  variant="calibrated"
  captionTitle="Calibrated model."
  captionBody="After accounting for regime scaling, allocator retention, and indexing limits, every point in the calibration set stays on or below the diagonal."
/>

<div class="article-note article-note--success">
  <p class="article-note__label">Calibration result</p>
  <p>
    <strong>Every point sits below the diagonal.</strong> On this dataset, the estimator is
    conservative everywhere.
  </p>
</div>

## What We Learned

The calibrated model removed the cases where the first attempt would have OOMed. A few lessons from this work carry beyond mode solvers:

<ol class="article-takeaways">
  <li>
    <strong>Separate fixed overhead from growth terms.</strong> If your model includes a constant
    startup cost, scaling the whole expression inflates predictions at small sizes without helping
    at large sizes. Scale only the terms that actually grow with the input.
  </li>
  <li>
    <strong>Measure in subprocesses.</strong> Long-lived workers accumulate allocator artifacts. If
    you want clean peak RSS numbers, fork a child, do the work, and measure before it exits. Watch
    for allocator retention and use <code>maxtasksperchild=1</code> where it matters.
  </li>
  <li>
    <strong>Validate metadata-time predictions against runtime observations.</strong> It is easy to
    write a detection rule that works for the common case but misses edge cases like angle plus
    rotation or PML without lossy materials. Asserting that the frontend prediction never
    under-predicts the backend's actual matrix structure catches these before production.
  </li>
  <li>
    <strong>Not every ceiling is a memory ceiling.</strong> Integer overflow in sparse indexing,
    32-bit pointer limits, or process-count limits can all masquerade as OOM. When a solver fails at
    a suspiciously round problem size, check the index types before adding more RAM.
  </li>
</ol>

Get the memory estimate wrong, and you pay for it in retries, latency, or wasted RAM.
