--- title: "Predicting Peak Memory for an Electromagnetic Mode Solver" description: "How we replaced a heuristic memory estimate with a calibrated model for Tidy3D mode solver workloads, eliminating under-predictions across the calibration set." canonical_url: "https://engineering.flexcompute.com/articles/mode-solver-memory-calibration/" markdown_url: "https://engineering.flexcompute.com/articles/mode-solver-memory-calibration.md" image: "https://engineering.flexcompute.com/images/og/mode-solver-memory-calibration.png" publish_date: "2026-04-15" updated_date: "2026-04-15" kind: "Case Study" authors: - "Momchil Minkov" tags: - "Photonics" - "Tidy3D" - "Verification" --- import MemoryScatterFigure from '../../components/MemoryScatterFigure.astro'; Before launching a mode solve, the platform has to choose a machine and a parallelism strategy without yet knowing the solve's true peak memory. That estimate depends less on the sparse matrix size than on what happens around it: fill-in during factorization, whether the problem is real or complex, how much state forked workers can share, and how MPI overhead grows with rank count. Our original estimate was a heuristic, and it missed in both directions. Some jobs OOMed on the first worker and had to be retried on a larger one; others landed on larger machines than they needed from the start. We replaced it with a calibrated model built from measured runs. On the calibration dataset, it removed under-predictions and pulled small-job estimates closer to reality.

At a glance

Goal

Predict peak mode-solver memory before launch well enough to choose machines safely.

Method

Fit an affine model, detect matrix regime from metadata, and model copy-on-write and MPI overhead explicitly.

Outcome

Zero under-predictions on the calibration set, with less excess padding on small jobs.

## Why the Old Estimate Broke A mode solver finds the guided electromagnetic modes of a waveguide by assembling a sparse eigenvalue problem and extracting a small number of eigenpairs with shift-invert iteration. The expensive step is LU factorization of the shifted matrix $A - \sigma I$. Most of the peak memory comes from fill-in during that factorization, not from the original sparse matrix itself. The scheduler has to estimate that peak before the solve starts, because it needs to choose both a machine and a parallelism plan. Two execution modes make that harder: - Frequency parallel: a Python `multiprocessing.Pool` of `N` workers, each solving one frequency independently. Total memory is not `N` times a single solve because forked workers share the parent process image via copy-on-write. - MPI parallel: a distributed direct solver that uses `K` MPI ranks for the LU factorization. Overhead scales sublinearly with rank count. The old estimate knew about these backends in a coarse way, but it missed several effects that turned out to matter. It did not distinguish lossless from lossy materials, or PML from non-PML boundaries, even though either can double or quadruple the effective matrix size. It also ignored mode count, sequential frequency accumulation, and copy-on-write sharing. In practice, it was optimistic in some regions and overly padded in others.

Why the baseline failed

Many runs sit above the diagonal. Each such point is a job the scheduler would have started on too little memory, forcing a retry on a larger worker.

## What Drives Memory in a Sparse Eigensolver The shift-invert eigensolver builds a sparse matrix and factors it. Memory is dominated by fill-in: the additional non-zero entries created during LU factorization. For a 2D modal cross-section with $N$ grid points, the matrix structure depends on the physics: | Regime | Typical trigger | Effective matrix form | Relative weight | | -------------- | ----------------------------------------------- | --------------------------------------------------- | ------------------------------------------------------- | | Scalar real | Lossless materials without PML | `2N x 2N` real | Baseline | | Scalar complex | Lossy materials or PML boundaries | `2N x 2N` complex, often expanded to `4N x 4N` real | Roughly `4x` the scalar-real non-zeros | | Tensorial | Fully anisotropic media or off-axis propagation | `4N x 4N` complex | About `5x` the scalar-real memory at the same grid size | $N$ can range from a few thousand to nearly a million, and the matrix dimension determines fill-in volume. A flat per-point heuristic will be right for one regime and wrong for the others. ## Predicting Matrix Structure From Metadata Choosing the right memory model before launch means inferring two things from the simulation definition alone: whether the problem is tensorial, and whether the matrix will be complex. We have to get both right before assembling anything.

Tensorial detection

Angle is not enough

Treat the problem as tensorial for fully anisotropic media, or for off-axis propagation when angle_rotation is not active.

But angle_rotation can rotate the geometry back into an effectively scalar case, so angle alone can over-predict.

Complex detection

PML counts too

Treat the matrix as complex if any material has imaginary permittivity, or if any boundary uses PML.

PML introduces complex coordinate stretching even when every material is real, so missing it under-predicts.

This logic has to err on the side of safety. A conservative call wastes some capacity; an optimistic one can trigger an OOM and a retry. We checked the metadata-time prediction against the matrix structure the solver actually built at runtime across the full test suite, and treated any under-prediction as a bug. ## The Calibrated Model ### Calibration Harness We built a harness to run hundreds of controlled solves while sweeping:

7k to 924k

Grid points

1 to 10

Modes

1 to 5

Frequencies

1 to 5

Pool workers

scipy and MUMPS/PETSc

Solver backends

Lossless, lossy, PML, tensorial

Material regimes

Each experiment runs in a fresh subprocess so we get a clean resident set size (RSS) measurement for that solve, rather than allocator leftovers from earlier work. We used a separate MPI scaling dataset for larger problems, spanning 366k to 8.9M grid points and 1 to 28 MPI ranks. ### Model Structure The final estimator has four parts:

Fit a base affine model

Start from scalar real cases and fit intercept, per-point slope, and mode-count slope separately for each solver backend.

Scale only the size-dependent terms

Apply the regime multiplier to the slopes, not the intercept, so small jobs do not inherit exaggerated startup overhead.

Account for pool copy-on-write behavior

Model the first worker separately, then add only the incremental per-worker overhead that survives copy-on-write sharing.

Add MPI overhead as a power law

For distributed solves, model extra rank cost separately and use different coefficients for real and complex matrices.

The base model is affine, fitted on scalar real cases:

Base affine fit

```text single_solve_gb = intercept + pts_coeff * (pts / 1000) + modes_coeff * num_modes ``` We fit separate coefficients for each solver backend. The intercept, roughly 1 GB, captures Python and library startup overhead. Tensorial and complex problems need a regime scale factor on the memory estimate, but it should apply only to the size-dependent terms. The intercept is startup overhead; it does not change with matrix structure. Scaling the whole estimate inflates predictions by an order of magnitude at small sizes, where the intercept dominates, without buying anything at large sizes, where the slope dominates. Scaling only the growing terms kept small solves tighter while preserving headroom where it mattered.

Regime-aware scaling

```text scaled_solve = intercept + (pts_coeff * pts / 1000 + modes_coeff * num_modes) * regime_scale ``` For frequency-parallel execution with a pool of `N` workers, total memory accounts for copy-on-write sharing:

Frequency-parallel pool model

```text first_worker = scaled_solve + accumulation * (freqs_per_worker - 1) incremental = min(per_worker_overhead, first_worker) total = first_worker + (N - 1) * incremental + safety_buffer ``` The `accumulation` term captures memory growth when one worker handles multiple frequencies sequentially. For MPI-parallel execution with `K` ranks, where frequencies are solved sequentially, the overhead follows a power law:

MPI overhead model

```text mpi_overhead = (alpha * pts_millions + beta) * (K - 1)^gamma total = scaled_solve + mpi_overhead + accumulation * (nf - 1) + safety_buffer ``` There are separate `(alpha, beta, gamma)` coefficients for real versus complex matrices because the distributed solver expands complex matrices to $4N \times 4N$ real, quadrupling fill-in. ### Allocator Retention in Forked Workers When Python forks pool workers, the parent's memory is shared copy-on-write, so the cost of adding a worker is much smaller than duplicating a full solve. But the calibration data exposed a second effect: glibc malloc arena accumulation was inflating RSS across sequential solves handled by the same worker. Each frequency solve allocates and frees large sparse arrays, yet the allocator often keeps that memory mapped instead of returning it to the OS. After a few solves, the worker can look materially larger than a fresh process doing the same job. That polluted the calibration data, especially when one worker processed multiple frequencies, and led to scale factors that were too pessimistic. Setting `maxtasksperchild=1` in the multiprocessing pool fixed it: each worker gets recycled after a single task. Re-running calibration produced cleaner data and tighter coefficients. Single-frequency or single-worker cases skip the pool entirely and run in-process.

Operational change

`maxtasksperchild=1` became part of the runtime policy. Recycle each worker after one task so allocator arenas do not accumulate across sequential frequencies.

### Indexing Limits in Sparse Factorization Calibration also surfaced a failure that looked like a memory problem but wasn't. At large grid counts in the tensorial regime, the solver crashed with a cryptic error even though the machine still had plenty of RAM left. The cause was a 32-bit integer overflow in the sparse factorization's internal indexing. For a tensorial $4N \times 4N$ matrix at large $N$, fill-in scales roughly as $O(n^{3/2})$ and can exceed the 2.15 billion non-zero limit of 32-bit indices. More RAM doesn't help. We now validate the grid point count before submission and return a clear error, rather than letting it crash deep inside the factorization.

Hard limit

This is an indexing ceiling, not a RAM ceiling. When a solver fails at a suspiciously round problem size, check the index types before adding more RAM.

Calibration result

Every point sits below the diagonal. On this dataset, the estimator is conservative everywhere.

## What We Learned The calibrated model removed the cases where the first attempt would have OOMed. A few lessons from this work carry beyond mode solvers:

Separate fixed overhead from growth terms. If your model includes a constant startup cost, scaling the whole expression inflates predictions at small sizes without helping at large sizes. Scale only the terms that actually grow with the input.
Measure in subprocesses. Long-lived workers accumulate allocator artifacts. If you want clean peak RSS numbers, fork a child, do the work, and measure before it exits. Watch for allocator retention and use maxtasksperchild=1 where it matters.
Validate metadata-time predictions against runtime observations. It is easy to write a detection rule that works for the common case but misses edge cases like angle plus rotation or PML without lossy materials. Asserting that the frontend prediction never under-predicts the backend's actual matrix structure catches these before production.
Not every ceiling is a memory ceiling. Integer overflow in sparse indexing, 32-bit pointer limits, or process-count limits can all masquerade as OOM. When a solver fails at a suspiciously round problem size, check the index types before adding more RAM.

Get the memory estimate wrong, and you pay for it in retries, latency, or wasted RAM.