Enforcing physics in NF / VSSGP training — basis, data, loss
Enforcing physics in the VSSGP / neural-field training pipeline¶
Companion to SIREN ↔ RFF ↔ VSSGP — Bayesian Fourier features for geoscience interpolation. This note answers the question: given that we are training a neural field (SIREN) or a VSSGP and then evaluating it — including outside the training footprint — where does physical knowledge enter, and how do those entry points differ between the two methods?
The setup, written so it covers both methods:
| Neural field (SIREN) | VSSGP | |
|---|---|---|
| Function form | ||
| What is learned | Network weights θ via SGD on a loss | Weight posterior in closed form (linear case) or via SVI |
| Prediction at | — point estimate | + closed-form variance |
| Out-of-domain behaviour | Whatever the network extrapolates to (often poorly behaved) | Reverts smoothly to prior — variance grows, mean → 0; spectral structure of remains |
Both share three axes where physics can enter the pipeline:
- Basis — the function class (RFF spectral prior; SIREN architecture).
- Data — input preprocessing, input embeddings, output preprocessing, output parameterisation.
- Loss — data fit, regularisers, PDE-residual / soft-physics terms.
These are nearly independent. You can intervene on any subset; effects on training and on out-of-domain behaviour are roughly additive. The rest of the note is one section per axis.
0. Method neutrality — what carries across, what does not¶
Before the axis-by-axis treatment: the two methods agree on what physics looks like at each axis, but disagree on what’s free vs. expensive at each axis.
| Move | NF (SIREN) | VSSGP |
|---|---|---|
| Spectral prior on energy distribution | ❌ no native — heuristic ω₀ at first layer is a degenerate proxy | ✅ explicit p(ω) (the headline of 00_siren_rff_vssgp.md) |
| Anisotropic / flow-aligned structure | ⚠️ approximate via input coordinate transform | ✅ Matérn-SPDE prior — exact |
| Periodic features for known harmonics | ✅ stack into input layer (positional encoding) | ✅ stack into basis ϕ |
| Predict scalar potential, derive constrained field via autodiff | ✅ jax.grad(NN_θ) | ✅ analytic differentiation of ϕ |
| Output transform (log, link function) | ✅ trivial | ✅ trivial — but breaks closed-form for non-Gaussian |
| Data-fit likelihood | Gaussian by default; Student-T / log-normal via custom loss | Gaussian closed-form; non-Gaussian via SVI / Laplace |
| PDE-residual / virtual-obs loss | ✅ standard PINN — joint SGD | ✅ closed-form for linear PDEs; SVI/Laplace otherwise |
| Learned operator inside the loss | ✅ native — NN-in-loss is the norm | ⚠️ awkward — breaks closed-form Bayes |
| Posterior over predictions (incl. OOD) | ❌ point estimate — needs ensembles or MCD for any uncertainty | ✅ closed-form Gaussian posterior at every point |
The pattern is consistent: VSSGP is naturally Bayesian and natively handles linear constraints in closed form; NF is naturally flexible and natively handles arbitrary nonlinear losses with a learned operator inside. For the kinds of physics the ocean reconstruction problem asks for — mostly linear (geostrophy, periodicity, advection-with-frozen-velocity, harmonic tides) plus a small amount of genuinely uncertain nonlinear physics (Chl-a sources) — VSSGP fills more of the space cleanly. NF earns its niche specifically where the operator itself needs to be learned.
1. AXIS — basis (function class)¶
The basis controls what shapes of function f is allowed to take, before any data is seen. For VSSGP this is p(ω) plus the input dimensionality of the RFF features; for NF this is the architecture (depth, width, activation, ω₀ for SIREN, positional encoding).
1.1 Informative spectral prior — VSSGP-native, NF approximate¶
For VSSGP, this is the headline mechanism from 00_siren_rff_vssgp.md §3: pick p(ω) to match the known spectral structure of the field (mesoscale band for SSH, annual harmonic for SST, etc.). Spectral mass on the right frequencies → most of the M features are useful.
For SIREN, there is no p(ω), but there is the first-layer scaling ω₀. SIREN init is W₁ ~ U(−√(6/n), √(6/n)), then activations are sin(ω₀ W₁ x). The effective spectral measure is uniform on a box |ω| ≤ ω₀ √(6/n), dimensionwise. This is a bad approximation to a banded geophysical signal — see 00_siren_rff_vssgp.md §2 for the spectral-budget argument.
A cleaner SIREN cousin is Fourier feature networks (Tancik et al. 2020): replace the first layer’s learned W₁ with a fixed random matrix sampled from your chosen p(ω), then put the MLP on top of these features. This brings SIREN into the same spectral-prior framework as VSSGP at the architecture level — strictly recommended for any SIREN run targeting geoscience data.
Recipe.
# VSSGP: draw RFF frequencies from informative p(ω)
Omega = sample_spectral_mixture(M, components=miost_components, key=key)
# SIREN: fix the first-layer matrix to RFF frequencies, put MLP on top
class FourierFeatureSIREN(eqx.Module):
Omega_fixed: Array # (M, d), drawn from p(ω) once, frozen
mlp: ...
def __call__(self, x):
gamma = jnp.concatenate([jnp.cos(self.Omega_fixed @ x),
jnp.sin(self.Omega_fixed @ x)])
return self.mlp(gamma)Either way, the basis-axis physics is the same act: tell the model where energy lives in frequency space.
1.2 Anisotropic / flow-aligned basis¶
Genuine VSSGP/GP win — Matérn-SPDE prior with a position-dependent diffusivity tensor D(x) = R(θ_flow) diag(L_∥², L_⊥²) R(θ_flow)ᵀ encodes DYMOST-style anisotropy directly into the kernel (00_siren_rff_vssgp.md §3.0.2).
For NF, the only available approximation is an input coordinate transform that biases the network toward the right axes:
def aligned_input(x, theta_flow_field, ratio_field):
R = flow_rotation(x, theta_flow_field)
L = jnp.diag([1.0, ratio_field(x)])
return R.T @ L @ R @ xThis is a soft architectural prior, not a hard basis constraint — it biases the implicit kernel of the NF without enforcing it. If anisotropy is critical (Gulf Stream, Kuroshio), VSSGP wins on this axis.
1.3 Multi-scale / additive components¶
For both methods. VSSGP: stack Φ = [Φ_gyre | Φ_meso | Φ_sub] with per-component variance scaling — see 00_siren_rff_vssgp.md §4. NF: parallel branches (one per scale octave) with their own first-layer Ω_n, summed at the output. Both express the same physical content: “the field is a sum of components, each with a known scale band and amplitude.”
2. AXIS — data (input and output)¶
The data axis covers four sub-mechanisms: (2A) preprocessing the inputs x, (2B) augmenting the inputs with engineered features, (2C) preprocessing the observed y, and (2D) parameterising what f predicts (output structure). The first three are about what enters the model; the fourth is about what comes out. All four work the same way for NF and VSSGP.
2A. Input preprocessing¶
Coordinate normalisation, lat-lon → 3D Cartesian on the sphere (avoids the longitudinal seam), time → days since epoch with a per-domain offset. Standard ML hygiene; no physics yet, but matters because both SIREN’s ω₀ and the VSSGP RFF frequencies are scale-sensitive.
For SSH specifically, working in f-plane regional patches (small enough that Coriolis f is approximately constant) lets you treat f/g as a constant — needed for the scalar-potential trick in §2D below.
2B. Input embeddings — periodic features for known harmonics¶
Annual (T = 365.25 d), semiannual (182.6 d), diurnal (1.0 d), M2 tide (0.5175 d) are exact periodicities. Stack them into the input as
before any other processing. Both methods then have a hard subspace where the harmonic content lives, learnt via the standard data fit.
def periodic_embedding(x, t):
periods = jnp.array([365.25, 182.6, 1.0, 0.5175]) # days
h = jnp.concatenate([jnp.cos(2*jnp.pi*t/periods),
jnp.sin(2*jnp.pi*t/periods)])
return jnp.concatenate([x, h])Recipe is identical for VSSGP (φ(periodic_embedding(x, t))) and SIREN (input layer takes the embedding). High-leverage move because it removes a chunk of the training residual that the basis would otherwise have to spend M features approximating.
2C. Output preprocessing — subtract known physics from y¶
Before the model sees the observations, subtract everything that is already known. For SSH (covered in 02 §2a):
| Subtraction | Source | Removes |
|---|---|---|
| Wet/dry tropospheric, ionospheric, sea-state bias | CMEMS L3 standard corrections | Path delay biases (~few cm each) |
| Solid Earth + pole tide, ocean tides FES2014/2022 | tide model | Earth-body + ocean tides (~30 cm + ~1 m peak) |
| Dynamic Atmospheric Correction (DAC) | MOG2D + IB | Storm-driven barotropic response (~10 cm at mid-lat) |
| Mean Dynamic Topography | CNES-CLS-22 | Time-mean SSH so you model the anomaly |
For SST: subtract climatology (Reynolds OISST mean) so the model sees an anomaly, not the absolute temperature. For SSS: subtract the multi-year mean. For Chl-a: log-transform first (the field is log-normal), then optionally subtract a seasonal climatology in log-space.
This is upstream of the model entirely and works identically for NF and VSSGP. Highest-leverage single move on the data axis — without it, the field’s mean dynamic structure dominates the spectral content and the prior’s mesoscale band gets drowned out.
2D. Output parameterisation — predict a scalar latent, derive the constrained field¶
The most under-used trick on the data axis. Instead of training f to predict the constrained vector field directly, train it to predict a scalar latent and apply a known operator to get the field. Constraints linear in f become automatic for every prediction — including out-of-domain ones.
The canonical example is divergence-free flow from a streamfunction:
For SSH: train on η, then derive geostrophic currents post-hoc as (u_g, v_g) = (-(g/f)∂_y η, (g/f)∂_x η). Working with the scalar η is the scalar-latent trick — the resulting (u_g, v_g) is divergence-free for every η your model predicts.
| Output structure | NF recipe | VSSGP recipe |
|---|---|---|
| Direct field | ||
| Scalar potential, derive vector field | via jax.grad | via analytic derivative of ϕ |
| Log-transformed field | for Chl-a | ; loses Gaussian likelihood |
| Multi-output (joint SSH+SST) | Vector head; shared backbone | Block-diagonal , independent posteriors per channel; couple via shared spectral prior |
Why this is so powerful for OOD prediction. The constraint holds for every prediction, including in regions far from training data — divergence-freeness, harmonic content, log-positivity all transfer to the extrapolation regime. This is qualitatively different from a soft-loss approach where the constraint is only enforced where collocation points are placed. For OOD prediction, output parameterisation is the only mechanism that reliably enforces a constraint everywhere.
2E. Data augmentation — derivative observations¶
Drifter / HF-radar / Doppler-altimetry velocities give (u_g, v_g) directly, which by geostrophy is (u_g, v_g) = (-(g/f) ∂_y η, (g/f) ∂_x η). Inverting, the derivative observations are ∂_y η = -(f/g) u_g and ∂_x η = (f/g) v_g — a linear functional of η. Add these to the training set as derivative observations, ordered to match what predict_eta_and_grad returns (here (η, ∂_y η, ∂_x η)):
# y_augmented = [η_obs, ∂_y η at drifters, ∂_x η at drifters]
y_augmented = jnp.concatenate([eta_obs, -(f/g) * u_drifter, (f/g) * v_drifter])
loss_data = mse(predict_eta_and_grad(x, theta), y_augmented)For NF: predict_eta_and_grad computes η from the network and uses jax.grad to get gradient observations. For VSSGP: cross-covariance kernel rows K_{η, ∂η} = ∂_{x'} k(x, x') via jax.grad(k) — same construction, closed-form.
Both methods get the same physical content from this data augmentation.
3. AXIS — loss¶
The loss combines the data-fit term with regularisers and (for soft-physics) PDE-residual terms. Three sub-mechanisms.
3A. Data-fit likelihood¶
| Variable | Likelihood | NF form | VSSGP form |
|---|---|---|---|
| SSH | Gaussian (or correlated-noise for along-track) | MSE | Closed-form |
| SST | Student-T (cloud-contaminated IR) | Robust loss | SVI / Laplace |
| SSS | Student-T (RFI-contaminated SMOS) | Robust loss | SVI / Laplace |
| Chl-a (in log-space) | Gaussian on log y | MSE on log f | Closed-form on log f |
For VSSGP, anything beyond Gaussian breaks closed-form. The cost is real — Laplace / SVI introduces SGD-like training even on the GP side. For Gaussian-or-near-Gaussian variables (SSH, SST anomalies after climatology subtraction), VSSGP keeps its closed-form advantage; for genuinely heavy-tailed cases the methods are roughly on par cost-wise.
3B. Soft-physics — PDE residual at collocation points¶
The PINN move, with the same algebraic content for both methods:
In VSSGP language the second term is exactly the negative log-likelihood of virtual observations 0 = L f(x_c) + ε_c, with ε_c ~ N(0, σ_pde²). So it slots into the GP solve as additional rows in y and K. The cross-covariance machinery K_{f, Lf}(x, x') = L_{x'} k(x, x') is computed via jax.grad(k) — same primitive used for derivative observations in §2E.
| Aspect | NF | VSSGP |
|---|---|---|
Linear PDE (L f linear in f) | Joint SGD | Closed-form posterior in one solve |
| Nonlinear PDE | Joint SGD | Laplace / SVI |
| Per-collocation-point cost | One forward + autodiff per epoch | One row in augmented Gram |
Sensitivity to σ_pde | Loss-landscape stiffness; well-known PINN failure mode[1] | Conditioning of augmented Gram; standard Tikhonov fixes |
Per-variable PDE residuals¶
| Variable | Operator | Linear in f? |
|---|---|---|
| SSH | Geostrophic balance: | Yes — but enforced hard via §2D, no loss term needed |
| SSH | Linearised PV: | Yes — soft constraint, closed-form for VSSGP |
| SSH | Full nonlinear PV: with quadratic Jacobian | No — Laplace / SGD |
| SST/SSS/Chl-a | Advection-diffusion (frozen u): | Yes (with frozen u) — soft constraint, two-stage |
| Chl-a | Add bio source-sink term S | No — needs learned (see §3D) |
The two-stage workflow for tracers: first reconstruct SSH (using §1 + §2 + §3A), freeze u_g, then for the tracer add the advection-diffusion residual as a soft constraint. Cost: one frozen velocity field, then a linear soft-physics solve.
3C. Regularisation — weight prior¶
For VSSGP this is built in (w ~ N(0, Σ_w)); for NF it has to be added as ‖θ‖² weight decay or a structured regulariser. Worth distinguishing:
| Regulariser | NF | VSSGP |
|---|---|---|
Plain ‖θ‖² weight decay | Standard | Equivalent to isotropic Σ_w = σ_w² I |
Per-component variance scale (different σ_w per scale octave) | Manual — pre-compute parameter groups, apply different decay to each | Closed-form via block-diagonal Σ_w (see 00_siren_rff_vssgp.md §4) |
Spatially varying amplitude σ_w(x) from DUACS | Awkward — needs gating into the network | Closed-form (one line in Σ_w) |
| Sparsity (Laplace prior on weights) | L1 weight decay | Spike-and-slab; expensive |
VSSGP is more flexible at the regulariser level because the weight prior is an explicit object, not a single scalar λ.
3D. Learned operators — when L itself is the unknown¶
For Chl-a, the source term S(T, x, t) is biology-dependent and uncertain. Replace S with a small neural network S_θ(η, T, ∇T, x) and train end-to-end with the data fit + advection-diffusion residual:
def loss(theta_field, theta_S, x_data, y_data, x_collocation):
# Data fit
L_data = mse(f_theta_field(x_data) - y_data)
# Soft physics with learned source
residual = advect_diffuse(f_theta_field, x_collocation, u_g_frozen) \
- S_theta_S(f_theta_field, x_collocation)
L_phys = jnp.mean(residual ** 2) / sigma_pde**2
# Regulariser on the learned operator
L_reg = mu * jnp.sum(theta_S ** 2)
return L_data + L_phys + L_regThis is NF-native territory — putting an NN inside the loss is the standard pattern. The VSSGP equivalent (NN coefficients as marginal-likelihood hyperparameters) loses the closed-form posterior on θ_S and is awkward in practice. For the four target variables, only Chl-a clearly needs this; the other operators are well-known enough that fixed coefficients suffice.
4. Decision matrix — which axis × method for which constraint?¶
For each constraint candidate across SSH/SST/SSS/Chl-a:
| Constraint | Variable | Axis | Mechanism | NF | VSSGP |
|---|---|---|---|---|---|
| Mesoscale spectral content | SSH | Basis | Informative p(ω) (MIOST init) | ⚠️ Fourier-feature wrapper | ✅ native |
| Annual / diurnal harmonics | All | Data — input | Periodic embedding | ✅ | ✅ |
| Anisotropic flow-aligned structure | SSH | Basis | Matérn-SPDE / coord transform | ⚠️ approximate | ✅ exact |
| Subtract tides + DAC + MDT | SSH | Data — output preproc | Upstream subtraction | ✅ | ✅ |
| Subtract climatology | SST/SSS | Data — output preproc | Standardise to anomaly | ✅ | ✅ |
| Log-transform | Chl-a | Data — output param | Predict log f | ✅ | ⚠️ breaks Gaussian likelihood |
Geostrophic currents from η | SSH | Data — output param | Scalar latent + jax.grad / analytic deriv | ✅ | ✅ |
| Drifter velocities as ∂η obs | SSH | Data — augmentation | Cross-cov kernel rows / autodiff | ✅ | ✅ closed-form |
| Linearised PV conservation | SSH | Loss | Soft / virtual obs | ✅ SGD | ✅ closed-form |
| Full nonlinear PV | SSH | Loss | Soft (nonlinear) | ✅ SGD | ⚠️ SVI/Laplace — same cost as SGD |
Advection-diffusion (frozen u) | SST/SSS/Chl-a | Loss | Soft, two-stage | ✅ SGD | ✅ closed-form |
| Bio source/sink | Chl-a | Loss | Learned S_θ | ✅ native | ❌ awkward |
| Sub-grid closure | All | Loss | Learned operator | ✅ native | ❌ |
The pattern (consistent with §0):
- Linear constraints favour VSSGP — closed-form posterior, single solve, calibrated OOD uncertainty for free.
- Nonlinear-but-known constraints are roughly equal — both need iterative optimisation.
- Learned operators favour NF — the Chl-a source term is the only place this clearly matters across the four variables.
The most expensive axis (loss with learned operator) is also the one where the methods most clearly diverge.
5. Recommended sequence — what to implement first¶
For each variable, in order of leverage (highest first), with method and axis labelled:
SSH¶
- [Data — output preproc] Subtract tides, DAC, MDT → work with anomaly . ½-day fix; without it nothing else is right. Both methods.
- [Data — input] Periodic harmonics (annual, M2). One-line input embedding. Both methods.
- [Data — output param] Predict scalar
η, derive currents post-hoc. Free divergence-free constraint everywhere — including OOD. Both methods. - [Basis] Informative spectral prior (MIOST mixture) — VSSGP native. For NF, use Fourier-feature first layer with the same
p(ω). - [Data — augmentation] Drifter velocities as derivative obs. Cheap, high-physical-content. Both methods.
- [Loss] Linearised PV soft constraint at collocation points. VSSGP wins here — closed-form, single solve. NF fallback: standard PINN.
- [Loss] Full nonlinear PV. Optional, only after 1–6 plateau. Parity between methods.
SST¶
- [Data — output preproc] Subtract Reynolds OISST climatology.
- [Data — input] Periodic harmonics (annual, semiannual, diurnal).
- [Loss] Student-T likelihood for cloud-contaminated retrievals.
- [Basis] Spectral prior centred on frontal scales (~50 km).
- [Loss] Two-stage advection-diffusion with
u_gfrom SSH reconstruction. VSSGP closed-form for the linear PDE.
SSS¶
Same as SST minus the diurnal harmonic (negligible). Spatially-varying σ_pde near coasts where S is uncertain.
Chl-a¶
- [Data — output param] Log-transform. Predict
log f. NF: trivial. VSSGP: breaks Gaussian likelihood — need SVI / Laplace. - [Data — input] Periodic harmonic (annual; bloom seasonality).
- [Loss] Two-stage advection-diffusion without source term. Captures bloom-edge propagation.
- [Loss] Learned source
S_θonly if observed bloom edges are sharper than the reconstruction. NF wins here — putting an NN inside the loss is native.
Out-of-domain experiments¶
The most important test of the basis-axis informative prior is prediction outside the training footprint. Two recommended OOD splits:
- Spatial OOD — train on a subdomain (e.g. Z2 Gulf Stream), evaluate on a held-out adjacent region (Z3 \ Z2 in 00_siren_rff_vssgp.md §6.2). The prior’s spectral structure dominates beyond the training boundary.
- Temporal OOD — train on Jan–Sep, evaluate on Oct–Dec. Tests whether the seasonal embeddings + spectral prior generalise over time.
For VSSGP, report the posterior mean and per-point standard deviation — the latter should grow as you move out of domain, with the right spectral content (visible by drawing a few posterior samples and inspecting their power spectrum). For NF, report a deep-ensemble mean ± std (5–10 independently trained networks); SIREN extrapolation is famously unstable[2] and the ensemble spread is the cheapest signal of that instability.
6. My take¶
The constraint-mechanism story for this project is three axes (basis / data / loss), not one, and they have very different leverage profiles for OOD prediction:
Axis ordering by OOD leverage: Data > Basis > Loss.
That ranking is specific to out-of-domain prediction and matters because it inverts the “in-domain” intuition.
- Data-axis interventions enforce constraints everywhere — including OOD. Predict
ηnot(u, v), work in log-space for Chl-a, subtract MDT, embed periodic harmonics — these constraints hold in the extrapolation regime by construction. They are the highest-leverage moves and they are method-agnostic. - Basis-axis interventions shape what the prior asserts in the gap — they don’t enforce constraints, they bias which prior samples look plausible. This is where the informative
p(ω)from 00_siren_rff_vssgp.md does its work, and it’s most measurable in OOD regions where the prior dominates. - Loss-axis interventions enforce constraints only at the collocation points used during training — they do not generalise to OOD by themselves. A PINN trained with PV-residual collocation in Region A does not produce PV-conserving extrapolations in Region B unless you place collocation points there. For OOD specifically, soft-physics losses are the weakest of the three axes.
This is the part of the picture I had wrong in earlier framings of the doc: I was treating soft physics as the workhorse and basis/data as additions. For an OOD-prediction project the order is the reverse.
Concrete recommendation for siren_vs_rff:
- Lock in all the data-axis moves first (preprocessing, periodic embeddings, scalar-latent parameterisation, derivative obs). These are the OOD-leverage moves and they apply identically to NF and VSSGP, so you can run them as the controlled-comparison baseline.
- Then ablate the basis-axis informative prior — VSSGP with MIOST init vs. VSSGP with uniform
p(ω)vs. SIREN with default ω₀ vs. SIREN with Fourier-feature wrapper using the samep(ω). This is the cleanest test of the 00_siren_rff_vssgp.md spectral-budget claim, and the OOD region is where the differences should be most visible. - Add loss-axis soft physics last, with collocation points on a grid that covers the OOD region as well as the training region — otherwise the constraint won’t help the metric you actually care about.
For the methods themselves: VSSGP wins on basis and on linear-loss; NF wins on Chl-a’s learned source term. Use both, additively, where each is strong. The siren_vs_rff comparison is most informative if it reports (method × axis × OOD-vs-in-domain) cells separately rather than a single headline RMSE — the headline number hides the place where the methods most differ.
References¶
Krishnapriyan, A., Gholami, A., Zhe, S., Kirby, R., Mahoney, M. (2021). “Characterizing possible failure modes in physics-informed neural networks.” NeurIPS 34.
Lipman, Y., Aharoni, M. (2021). “On the failure modes of implicit neural representations under extrapolation.” Workshop on Implicit Neural Representations.