On this page Why the Old Estimate Broke
Before launching a mode solve, the platform has to choose a machine and a parallelism strategy without yet knowing the solve’s true peak memory. That estimate depends less on the sparse matrix size than on what happens around it: fill-in during factorization, whether the problem is real or complex, how much state forked workers can share, and how MPI overhead grows with rank count.
Our original estimate was a heuristic, and it missed in both directions. Some jobs OOMed on the first worker and had to be retried on a larger one; others landed on larger machines than they needed from the start. We replaced it with a calibrated model built from measured runs. On the calibration dataset, it removed under-predictions and pulled small-job estimates closer to reality.
At a glance
Goal
Predict peak mode-solver memory before launch well enough to choose machines safely.
Method
Fit an affine model, detect matrix regime from metadata, and model copy-on-write and MPI overhead explicitly.
Outcome
Zero under-predictions on the calibration set, with less excess padding on small jobs.
Why the Old Estimate Broke
A mode solver finds the guided electromagnetic modes of a waveguide by assembling a sparse eigenvalue problem and extracting a small number of eigenpairs with shift-invert iteration. The expensive step is LU factorization of the shifted matrix . Most of the peak memory comes from fill-in during that factorization, not from the original sparse matrix itself.
The scheduler has to estimate that peak before the solve starts, because it needs to choose both a machine and a parallelism plan. Two execution modes make that harder:
- Frequency parallel: a Python
multiprocessing.PoolofNworkers, each solving one frequency independently. Total memory is notNtimes a single solve because forked workers share the parent process image via copy-on-write. - MPI parallel: a distributed direct solver that uses
KMPI ranks for the LU factorization. Overhead scales sublinearly with rank count.
The old estimate knew about these backends in a coarse way, but it missed several effects that turned out to matter. It did not distinguish lossless from lossy materials, or PML from non-PML boundaries, even though either can double or quadruple the effective matrix size. It also ignored mode count, sequential frequency accumulation, and copy-on-write sharing. In practice, it was optimistic in some regions and overly padded in others.
Before calibration
Orange points mark the 263 calibration runs the old estimator would have launched with too little memory. The plot stays zoomed to 0–50 GB so the failure pattern remains readable.
Why the baseline failed
Many runs sit above the diagonal. Each such point is a job the scheduler would have started on too little memory, forcing a retry on a larger worker.
What Drives Memory in a Sparse Eigensolver
The shift-invert eigensolver builds a sparse matrix and factors it. Memory is dominated by fill-in: the additional non-zero entries created during LU factorization. For a 2D modal cross-section with grid points, the matrix structure depends on the physics:
| Regime | Typical trigger | Effective matrix form | Relative weight |
|---|---|---|---|
| Scalar real | Lossless materials without PML | 2N x 2N real | Baseline |
| Scalar complex | Lossy materials or PML boundaries | 2N x 2N complex, often expanded to 4N x 4N real | Roughly 4x the scalar-real non-zeros |
| Tensorial | Fully anisotropic media or off-axis propagation | 4N x 4N complex | About 5x the scalar-real memory at the same grid size |
can range from a few thousand to nearly a million, and the matrix dimension determines fill-in volume. A flat per-point heuristic will be right for one regime and wrong for the others.
Predicting Matrix Structure From Metadata
Choosing the right memory model before launch means inferring two things from the simulation definition alone: whether the problem is tensorial, and whether the matrix will be complex. We have to get both right before assembling anything.
Tensorial detection
Angle is not enough
Treat the problem as tensorial for fully anisotropic media, or for off-axis propagation when
angle_rotation is not active.
But angle_rotation can rotate the geometry back into an effectively scalar case,
so angle alone can over-predict.
Complex detection
PML counts too
Treat the matrix as complex if any material has imaginary permittivity, or if any boundary uses PML.
PML introduces complex coordinate stretching even when every material is real, so missing it under-predicts.
This logic has to err on the side of safety. A conservative call wastes some capacity; an optimistic one can trigger an OOM and a retry. We checked the metadata-time prediction against the matrix structure the solver actually built at runtime across the full test suite, and treated any under-prediction as a bug.
The Calibrated Model
Calibration Harness
We built a harness to run hundreds of controlled solves while sweeping:
7k to 924k
Grid points
1 to 10
Modes
1 to 5
Frequencies
1 to 5
Pool workers
scipy and MUMPS/PETSc
Solver backends
Lossless, lossy, PML, tensorial
Material regimes
Each experiment runs in a fresh subprocess so we get a clean resident set size (RSS) measurement for that solve, rather than allocator leftovers from earlier work. We used a separate MPI scaling dataset for larger problems, spanning 366k to 8.9M grid points and 1 to 28 MPI ranks.
Model Structure
The final estimator has four parts:
01
Fit a base affine model
Start from scalar real cases and fit intercept, per-point slope, and mode-count slope separately for each solver backend.
02
Scale only the size-dependent terms
Apply the regime multiplier to the slopes, not the intercept, so small jobs do not inherit exaggerated startup overhead.
03
Account for pool copy-on-write behavior
Model the first worker separately, then add only the incremental per-worker overhead that survives copy-on-write sharing.
04
Add MPI overhead as a power law
For distributed solves, model extra rank cost separately and use different coefficients for real and complex matrices.
The base model is affine, fitted on scalar real cases:
Base affine fit
single_solve_gb = intercept + pts_coeff * (pts / 1000) + modes_coeff * num_modes
We fit separate coefficients for each solver backend. The intercept, roughly 1 GB, captures Python and library startup overhead.
Tensorial and complex problems need a regime scale factor on the memory estimate, but it should apply only to the size-dependent terms. The intercept is startup overhead; it does not change with matrix structure.
Scaling the whole estimate inflates predictions by an order of magnitude at small sizes, where the intercept dominates, without buying anything at large sizes, where the slope dominates. Scaling only the growing terms kept small solves tighter while preserving headroom where it mattered.
Regime-aware scaling
scaled_solve = intercept + (pts_coeff * pts / 1000 + modes_coeff * num_modes) * regime_scale
For frequency-parallel execution with a pool of N workers, total memory accounts for copy-on-write sharing:
Frequency-parallel pool model
first_worker = scaled_solve + accumulation * (freqs_per_worker - 1)
incremental = min(per_worker_overhead, first_worker)
total = first_worker + (N - 1) * incremental + safety_buffer
The accumulation term captures memory growth when one worker handles multiple frequencies sequentially.
For MPI-parallel execution with K ranks, where frequencies are solved sequentially, the overhead follows a power law:
MPI overhead model
mpi_overhead = (alpha * pts_millions + beta) * (K - 1)^gamma
total = scaled_solve + mpi_overhead + accumulation * (nf - 1) + safety_buffer
There are separate (alpha, beta, gamma) coefficients for real versus complex matrices because the distributed solver expands complex matrices to real, quadrupling fill-in.
Allocator Retention in Forked Workers
When Python forks pool workers, the parent’s memory is shared copy-on-write, so the cost of adding a worker is much smaller than duplicating a full solve. But the calibration data exposed a second effect: glibc malloc arena accumulation was inflating RSS across sequential solves handled by the same worker. Each frequency solve allocates and frees large sparse arrays, yet the allocator often keeps that memory mapped instead of returning it to the OS. After a few solves, the worker can look materially larger than a fresh process doing the same job. That polluted the calibration data, especially when one worker processed multiple frequencies, and led to scale factors that were too pessimistic.
Setting maxtasksperchild=1 in the multiprocessing pool fixed it: each worker gets recycled after a single task. Re-running calibration produced cleaner data and tighter coefficients. Single-frequency or single-worker cases skip the pool entirely and run in-process.
Operational change
maxtasksperchild=1 became part of the runtime policy. Recycle each worker
after one task so allocator arenas do not accumulate across sequential frequencies.
Indexing Limits in Sparse Factorization
Calibration also surfaced a failure that looked like a memory problem but wasn’t. At large grid counts in the tensorial regime, the solver crashed with a cryptic error even though the machine still had plenty of RAM left.
The cause was a 32-bit integer overflow in the sparse factorization’s internal indexing. For a tensorial matrix at large , fill-in scales roughly as and can exceed the 2.15 billion non-zero limit of 32-bit indices. More RAM doesn’t help.
We now validate the grid point count before submission and return a clear error, rather than letting it crash deep inside the factorization.
Hard limit
This is an indexing ceiling, not a RAM ceiling. When a solver fails at a suspiciously round problem size, check the index types before adding more RAM.
After calibration
The calibrated model keeps all 593 calibration runs on or below the diagonal while this 0–50 GB view still shows clean regime separation.
Calibration result
Every point sits below the diagonal. On this dataset, the estimator is conservative everywhere.
What We Learned
The calibrated model removed the cases where the first attempt would have OOMed. A few lessons from this work carry beyond mode solvers:
Separate fixed overhead from growth terms. If your model includes a constant startup cost, scaling the whole expression inflates predictions at small sizes without helping at large sizes. Scale only the terms that actually grow with the input.
Measure in subprocesses. Long-lived workers accumulate allocator artifacts. If you want clean peak RSS numbers, fork a child, do the work, and measure before it exits. Watch for allocator retention and use
maxtasksperchild=1where it matters.Validate metadata-time predictions against runtime observations. It is easy to write a detection rule that works for the common case but misses edge cases like angle plus rotation or PML without lossy materials. Asserting that the frontend prediction never under-predicts the backend’s actual matrix structure catches these before production.
Not every ceiling is a memory ceiling. Integer overflow in sparse indexing, 32-bit pointer limits, or process-count limits can all masquerade as OOM. When a solver fails at a suspiciously round problem size, check the index types before adding more RAM.
Get the memory estimate wrong, and you pay for it in retries, latency, or wasted RAM.