Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Enforcing physics in NF / VSSGP training — basis, data, loss

Enforcing physics in the VSSGP / neural-field training pipeline

Companion to SIREN ↔ RFF ↔ VSSGP — Bayesian Fourier features for geoscience interpolation. This note answers the question: given that we are training a neural field (SIREN) or a VSSGP and then evaluating it — including outside the training footprint — where does physical knowledge enter, and how do those entry points differ between the two methods?

The setup, written so it covers both methods:

Neural field (SIREN)VSSGP
Function formfθ(x)=NNθ(x)f_\theta(x) = \mathrm{NN}_\theta(x)f(x)=ϕ(x) ⁣wf(x) = \phi(x)^{\!\top} w
What is learnedNetwork weights θ via SGD on a lossWeight posterior p(wy)p(w \mid y) in closed form (linear case) or via SVI
Prediction at xx^*fθ(x)f_\theta(x^*) — point estimateE[f(x)]=ϕ(x) ⁣μpost\mathbb{E}[f(x^*)] = \phi(x^*)^{\!\top} \mu_{\text{post}} + closed-form variance
Out-of-domain behaviourWhatever the network extrapolates to (often poorly behaved)Reverts smoothly to prior — variance grows, mean → 0; spectral structure of p(ω)p(\omega) remains

Both share three axes where physics can enter the pipeline:

  1. Basis — the function class (RFF spectral prior; SIREN architecture).
  2. Data — input preprocessing, input embeddings, output preprocessing, output parameterisation.
  3. Loss — data fit, regularisers, PDE-residual / soft-physics terms.

These are nearly independent. You can intervene on any subset; effects on training and on out-of-domain behaviour are roughly additive. The rest of the note is one section per axis.


0. Method neutrality — what carries across, what does not

Before the axis-by-axis treatment: the two methods agree on what physics looks like at each axis, but disagree on what’s free vs. expensive at each axis.

MoveNF (SIREN)VSSGP
Spectral prior on energy distribution❌ no native — heuristic ω₀ at first layer is a degenerate proxy✅ explicit p(ω) (the headline of 00_siren_rff_vssgp.md)
Anisotropic / flow-aligned structure⚠️ approximate via input coordinate transform✅ Matérn-SPDE prior — exact
Periodic features for known harmonics✅ stack into input layer (positional encoding)✅ stack into basis ϕ
Predict scalar potential, derive constrained field via autodiffjax.grad(NN_θ)✅ analytic differentiation of ϕ
Output transform (log, link function)✅ trivial✅ trivial — but breaks closed-form for non-Gaussian
Data-fit likelihoodGaussian by default; Student-T / log-normal via custom lossGaussian closed-form; non-Gaussian via SVI / Laplace
PDE-residual / virtual-obs loss✅ standard PINN — joint SGD✅ closed-form for linear PDEs; SVI/Laplace otherwise
Learned operator inside the loss✅ native — NN-in-loss is the norm⚠️ awkward — breaks closed-form Bayes
Posterior over predictions (incl. OOD)❌ point estimate — needs ensembles or MCD for any uncertainty✅ closed-form Gaussian posterior at every point

The pattern is consistent: VSSGP is naturally Bayesian and natively handles linear constraints in closed form; NF is naturally flexible and natively handles arbitrary nonlinear losses with a learned operator inside. For the kinds of physics the ocean reconstruction problem asks for — mostly linear (geostrophy, periodicity, advection-with-frozen-velocity, harmonic tides) plus a small amount of genuinely uncertain nonlinear physics (Chl-a sources) — VSSGP fills more of the space cleanly. NF earns its niche specifically where the operator itself needs to be learned.


1. AXIS — basis (function class)

The basis controls what shapes of function f is allowed to take, before any data is seen. For VSSGP this is p(ω) plus the input dimensionality of the RFF features; for NF this is the architecture (depth, width, activation, ω₀ for SIREN, positional encoding).

1.1 Informative spectral prior — VSSGP-native, NF approximate

For VSSGP, this is the headline mechanism from 00_siren_rff_vssgp.md §3: pick p(ω) to match the known spectral structure of the field (mesoscale band for SSH, annual harmonic for SST, etc.). Spectral mass on the right frequencies → most of the M features are useful.

For SIREN, there is no p(ω), but there is the first-layer scaling ω₀. SIREN init is W₁ ~ U(−√(6/n), √(6/n)), then activations are sin(ω₀ W₁ x). The effective spectral measure is uniform on a box |ω| ≤ ω₀ √(6/n), dimensionwise. This is a bad approximation to a banded geophysical signal — see 00_siren_rff_vssgp.md §2 for the spectral-budget argument.

A cleaner SIREN cousin is Fourier feature networks (Tancik et al. 2020): replace the first layer’s learned W₁ with a fixed random matrix sampled from your chosen p(ω), then put the MLP on top of these features. This brings SIREN into the same spectral-prior framework as VSSGP at the architecture level — strictly recommended for any SIREN run targeting geoscience data.

Recipe.

# VSSGP: draw RFF frequencies from informative p(ω)
Omega = sample_spectral_mixture(M, components=miost_components, key=key)

# SIREN: fix the first-layer matrix to RFF frequencies, put MLP on top
class FourierFeatureSIREN(eqx.Module):
    Omega_fixed: Array     # (M, d), drawn from p(ω) once, frozen
    mlp: ...

    def __call__(self, x):
        gamma = jnp.concatenate([jnp.cos(self.Omega_fixed @ x),
                                 jnp.sin(self.Omega_fixed @ x)])
        return self.mlp(gamma)

Either way, the basis-axis physics is the same act: tell the model where energy lives in frequency space.

1.2 Anisotropic / flow-aligned basis

Genuine VSSGP/GP win — Matérn-SPDE prior with a position-dependent diffusivity tensor D(x) = R(θ_flow) diag(L_∥², L_⊥²) R(θ_flow)ᵀ encodes DYMOST-style anisotropy directly into the kernel (00_siren_rff_vssgp.md §3.0.2).

For NF, the only available approximation is an input coordinate transform that biases the network toward the right axes:

def aligned_input(x, theta_flow_field, ratio_field):
    R = flow_rotation(x, theta_flow_field)
    L = jnp.diag([1.0, ratio_field(x)])
    return R.T @ L @ R @ x

This is a soft architectural prior, not a hard basis constraint — it biases the implicit kernel of the NF without enforcing it. If anisotropy is critical (Gulf Stream, Kuroshio), VSSGP wins on this axis.

1.3 Multi-scale / additive components

For both methods. VSSGP: stack Φ = [Φ_gyre | Φ_meso | Φ_sub] with per-component variance scaling — see 00_siren_rff_vssgp.md §4. NF: parallel branches (one per scale octave) with their own first-layer Ω_n, summed at the output. Both express the same physical content: “the field is a sum of components, each with a known scale band and amplitude.”


2. AXIS — data (input and output)

The data axis covers four sub-mechanisms: (2A) preprocessing the inputs x, (2B) augmenting the inputs with engineered features, (2C) preprocessing the observed y, and (2D) parameterising what f predicts (output structure). The first three are about what enters the model; the fourth is about what comes out. All four work the same way for NF and VSSGP.

2A. Input preprocessing

Coordinate normalisation, lat-lon → 3D Cartesian on the sphere (avoids the longitudinal seam), time → days since epoch with a per-domain offset. Standard ML hygiene; no physics yet, but matters because both SIREN’s ω₀ and the VSSGP RFF frequencies are scale-sensitive.

For SSH specifically, working in f-plane regional patches (small enough that Coriolis f is approximately constant) lets you treat f/g as a constant — needed for the scalar-potential trick in §2D below.

2B. Input embeddings — periodic features for known harmonics

Annual (T = 365.25 d), semiannual (182.6 d), diurnal (1.0 d), M2 tide (0.5175 d) are exact periodicities. Stack them into the input as

γ(t)=[cos(2πt/Tk),  sin(2πt/Tk)]k=1K\gamma(t) = \bigl[\cos(2\pi t/T_k),\; \sin(2\pi t/T_k)\bigr]_{k=1}^{K}

before any other processing. Both methods then have a hard subspace where the harmonic content lives, learnt via the standard data fit.

def periodic_embedding(x, t):
    periods = jnp.array([365.25, 182.6, 1.0, 0.5175])     # days
    h = jnp.concatenate([jnp.cos(2*jnp.pi*t/periods),
                         jnp.sin(2*jnp.pi*t/periods)])
    return jnp.concatenate([x, h])

Recipe is identical for VSSGP (φ(periodic_embedding(x, t))) and SIREN (input layer takes the embedding). High-leverage move because it removes a chunk of the training residual that the basis would otherwise have to spend M features approximating.

2C. Output preprocessing — subtract known physics from y

Before the model sees the observations, subtract everything that is already known. For SSH (covered in 02 §2a):

SubtractionSourceRemoves
Wet/dry tropospheric, ionospheric, sea-state biasCMEMS L3 standard correctionsPath delay biases (~few cm each)
Solid Earth + pole tide, ocean tides FES2014/2022tide modelEarth-body + ocean tides (~30 cm + ~1 m peak)
Dynamic Atmospheric Correction (DAC)MOG2D + IBStorm-driven barotropic response (~10 cm at mid-lat)
Mean Dynamic TopographyCNES-CLS-22Time-mean SSH so you model the anomaly ηa\eta_a

For SST: subtract climatology (Reynolds OISST mean) so the model sees an anomaly, not the absolute temperature. For SSS: subtract the multi-year mean. For Chl-a: log-transform first (the field is log-normal), then optionally subtract a seasonal climatology in log-space.

This is upstream of the model entirely and works identically for NF and VSSGP. Highest-leverage single move on the data axis — without it, the field’s mean dynamic structure dominates the spectral content and the prior’s mesoscale band gets drowned out.

2D. Output parameterisation — predict a scalar latent, derive the constrained field

The most under-used trick on the data axis. Instead of training f to predict the constrained vector field directly, train it to predict a scalar latent and apply a known operator to get the field. Constraints linear in f become automatic for every prediction — including out-of-domain ones.

The canonical example is divergence-free flow from a streamfunction:

ψθ(x)        predict        (u,v)=(yψθ,  +xψθ)\psi_\theta(x) \;\;\xrightarrow{\;\;\text{predict}\;\;}\;\; (u, v) = (-\partial_y \psi_\theta,\; +\partial_x \psi_\theta)

For SSH: train on η, then derive geostrophic currents post-hoc as (u_g, v_g) = (-(g/f)∂_y η, (g/f)∂_x η). Working with the scalar η is the scalar-latent trick — the resulting (u_g, v_g) is divergence-free for every η your model predicts.

Output structureNF recipeVSSGP recipe
Direct field f(x)f(x)f=NNθ(x)f = \mathrm{NN}_\theta(x)f=ϕ(x) ⁣wf = \phi(x)^{\!\top} w
Scalar potential, derive vector fieldψ=NNθ(x);  v=ψ\psi = \mathrm{NN}_\theta(x);\; \mathbf{v} = \nabla^\perp \psi via jax.gradψ=ϕ(x) ⁣w;  v\psi = \phi(x)^{\!\top} w;\; \mathbf{v} via analytic derivative of ϕ
Log-transformed fieldf=exp(NNθ(x))f = \exp(\mathrm{NN}_\theta(x)) for Chl-af=exp(ϕ(x) ⁣w)f = \exp(\phi(x)^{\!\top} w); loses Gaussian likelihood
Multi-output (joint SSH+SST)Vector head; shared backboneBlock-diagonal ww, independent posteriors per channel; couple via shared spectral prior

Why this is so powerful for OOD prediction. The constraint holds for every prediction, including in regions far from training data — divergence-freeness, harmonic content, log-positivity all transfer to the extrapolation regime. This is qualitatively different from a soft-loss approach where the constraint is only enforced where collocation points are placed. For OOD prediction, output parameterisation is the only mechanism that reliably enforces a constraint everywhere.

2E. Data augmentation — derivative observations

Drifter / HF-radar / Doppler-altimetry velocities give (u_g, v_g) directly, which by geostrophy is (u_g, v_g) = (-(g/f) ∂_y η, (g/f) ∂_x η). Inverting, the derivative observations are ∂_y η = -(f/g) u_g and ∂_x η = (f/g) v_g — a linear functional of η. Add these to the training set as derivative observations, ordered to match what predict_eta_and_grad returns (here (η, ∂_y η, ∂_x η)):

# y_augmented = [η_obs, ∂_y η at drifters, ∂_x η at drifters]
y_augmented = jnp.concatenate([eta_obs, -(f/g) * u_drifter, (f/g) * v_drifter])
loss_data = mse(predict_eta_and_grad(x, theta), y_augmented)

For NF: predict_eta_and_grad computes η from the network and uses jax.grad to get gradient observations. For VSSGP: cross-covariance kernel rows K_{η, ∂η} = ∂_{x'} k(x, x') via jax.grad(k) — same construction, closed-form.

Both methods get the same physical content from this data augmentation.


3. AXIS — loss

The loss combines the data-fit term with regularisers and (for soft-physics) PDE-residual terms. Three sub-mechanisms.

3A. Data-fit likelihood

VariableLikelihoodNF formVSSGP form
SSHGaussian (or correlated-noise for along-track)MSEClosed-form
SSTStudent-T (cloud-contaminated IR)Robust lossSVI / Laplace
SSSStudent-T (RFI-contaminated SMOS)Robust lossSVI / Laplace
Chl-a (in log-space)Gaussian on log yMSE on log fClosed-form on log f

For VSSGP, anything beyond Gaussian breaks closed-form. The cost is real — Laplace / SVI introduces SGD-like training even on the GP side. For Gaussian-or-near-Gaussian variables (SSH, SST anomalies after climatology subtraction), VSSGP keeps its closed-form advantage; for genuinely heavy-tailed cases the methods are roughly on par cost-wise.

3B. Soft-physics — PDE residual at collocation points

The PINN move, with the same algebraic content for both methods:

L(θ)=i=1N(yifθ(xi))2σobs2data fit  +  c=1mcLfθ(xc)2σpde2soft physics at collocation pts  +  R(θ)regulariser.\mathcal{L}(\theta) = \underbrace{\sum_{i=1}^{N} \tfrac{(y_i - f_\theta(x_i))^2}{\sigma_{\text{obs}}^2}}_{\text{data fit}} \;+\; \underbrace{\sum_{c=1}^{m_c} \tfrac{\|L f_\theta(x_c)\|^2}{\sigma_{\text{pde}}^2}}_{\text{soft physics at collocation pts}} \;+\; \underbrace{\mathcal{R}(\theta)}_{\text{regulariser}}.

In VSSGP language the second term is exactly the negative log-likelihood of virtual observations 0 = L f(x_c) + ε_c, with ε_c ~ N(0, σ_pde²). So it slots into the GP solve as additional rows in y and K. The cross-covariance machinery K_{f, Lf}(x, x') = L_{x'} k(x, x') is computed via jax.grad(k) — same primitive used for derivative observations in §2E.

AspectNFVSSGP
Linear PDE (L f linear in f)Joint SGDClosed-form posterior in one solve
Nonlinear PDEJoint SGDLaplace / SVI
Per-collocation-point costOne forward + autodiff per epochOne row in augmented Gram
Sensitivity to σ_pdeLoss-landscape stiffness; well-known PINN failure mode[1]Conditioning of augmented Gram; standard Tikhonov fixes
Per-variable PDE residuals
VariableOperatorLinear in f?
SSHGeostrophic balance: ug=(g/f)yηu_g = -(g/f)\partial_y\etaYes — but enforced hard via §2D, no loss term needed
SSHLinearised PV: t(2ηη/LR2)+βv=0\partial_t(\nabla^2\eta - \eta/L_R^2) + \beta v = 0Yes — soft constraint, closed-form for VSSGP
SSHFull nonlinear PV: Dq/Dt=0Dq/Dt = 0 with quadratic JacobianNo — Laplace / SGD
SST/SSS/Chl-aAdvection-diffusion (frozen u): tT+uTκ2T=S\partial_t T + \mathbf{u}\cdot\nabla T - \kappa\nabla^2 T = SYes (with frozen u) — soft constraint, two-stage
Chl-aAdd bio source-sink term SNo — needs learned SθS_\theta (see §3D)

The two-stage workflow for tracers: first reconstruct SSH (using §1 + §2 + §3A), freeze u_g, then for the tracer add the advection-diffusion residual as a soft constraint. Cost: one frozen velocity field, then a linear soft-physics solve.

3C. Regularisation — weight prior

For VSSGP this is built in (w ~ N(0, Σ_w)); for NF it has to be added as ‖θ‖² weight decay or a structured regulariser. Worth distinguishing:

RegulariserNFVSSGP
Plain ‖θ‖² weight decayStandardEquivalent to isotropic Σ_w = σ_w² I
Per-component variance scale (different σ_w per scale octave)Manual — pre-compute parameter groups, apply different decay to eachClosed-form via block-diagonal Σ_w (see 00_siren_rff_vssgp.md §4)
Spatially varying amplitude σ_w(x) from DUACSAwkward — needs gating into the networkClosed-form (one line in Σ_w)
Sparsity (Laplace prior on weights)L1 weight decaySpike-and-slab; expensive

VSSGP is more flexible at the regulariser level because the weight prior is an explicit object, not a single scalar λ.

3D. Learned operators — when L itself is the unknown

For Chl-a, the source term S(T, x, t) is biology-dependent and uncertain. Replace S with a small neural network S_θ(η, T, ∇T, x) and train end-to-end with the data fit + advection-diffusion residual:

def loss(theta_field, theta_S, x_data, y_data, x_collocation):
    # Data fit
    L_data = mse(f_theta_field(x_data) - y_data)
    # Soft physics with learned source
    residual = advect_diffuse(f_theta_field, x_collocation, u_g_frozen) \
               - S_theta_S(f_theta_field, x_collocation)
    L_phys = jnp.mean(residual ** 2) / sigma_pde**2
    # Regulariser on the learned operator
    L_reg  = mu * jnp.sum(theta_S ** 2)
    return L_data + L_phys + L_reg

This is NF-native territory — putting an NN inside the loss is the standard pattern. The VSSGP equivalent (NN coefficients as marginal-likelihood hyperparameters) loses the closed-form posterior on θ_S and is awkward in practice. For the four target variables, only Chl-a clearly needs this; the other operators are well-known enough that fixed coefficients suffice.


4. Decision matrix — which axis × method for which constraint?

For each constraint candidate across SSH/SST/SSS/Chl-a:

ConstraintVariableAxisMechanismNFVSSGP
Mesoscale spectral contentSSHBasisInformative p(ω) (MIOST init)⚠️ Fourier-feature wrapper✅ native
Annual / diurnal harmonicsAllData — inputPeriodic embedding
Anisotropic flow-aligned structureSSHBasisMatérn-SPDE / coord transform⚠️ approximate✅ exact
Subtract tides + DAC + MDTSSHData — output preprocUpstream subtraction
Subtract climatologySST/SSSData — output preprocStandardise to anomaly
Log-transformChl-aData — output paramPredict log f⚠️ breaks Gaussian likelihood
Geostrophic currents from ηSSHData — output paramScalar latent + jax.grad / analytic deriv
Drifter velocities as ∂η obsSSHData — augmentationCross-cov kernel rows / autodiff✅ closed-form
Linearised PV conservationSSHLossSoft / virtual obs✅ SGDclosed-form
Full nonlinear PVSSHLossSoft (nonlinear)✅ SGD⚠️ SVI/Laplace — same cost as SGD
Advection-diffusion (frozen u)SST/SSS/Chl-aLossSoft, two-stage✅ SGDclosed-form
Bio source/sinkChl-aLossLearned S_θ✅ native❌ awkward
Sub-grid closureAllLossLearned operator✅ native

The pattern (consistent with §0):

The most expensive axis (loss with learned operator) is also the one where the methods most clearly diverge.


For each variable, in order of leverage (highest first), with method and axis labelled:

SSH

  1. [Data — output preproc] Subtract tides, DAC, MDT → work with anomaly ηa\eta_a. ½-day fix; without it nothing else is right. Both methods.
  2. [Data — input] Periodic harmonics (annual, M2). One-line input embedding. Both methods.
  3. [Data — output param] Predict scalar η, derive currents post-hoc. Free divergence-free constraint everywhere — including OOD. Both methods.
  4. [Basis] Informative spectral prior (MIOST mixture) — VSSGP native. For NF, use Fourier-feature first layer with the same p(ω).
  5. [Data — augmentation] Drifter velocities as derivative obs. Cheap, high-physical-content. Both methods.
  6. [Loss] Linearised PV soft constraint at collocation points. VSSGP wins here — closed-form, single solve. NF fallback: standard PINN.
  7. [Loss] Full nonlinear PV. Optional, only after 1–6 plateau. Parity between methods.

SST

  1. [Data — output preproc] Subtract Reynolds OISST climatology.
  2. [Data — input] Periodic harmonics (annual, semiannual, diurnal).
  3. [Loss] Student-T likelihood for cloud-contaminated retrievals.
  4. [Basis] Spectral prior centred on frontal scales (~50 km).
  5. [Loss] Two-stage advection-diffusion with u_g from SSH reconstruction. VSSGP closed-form for the linear PDE.

SSS

Same as SST minus the diurnal harmonic (negligible). Spatially-varying σ_pde near coasts where S is uncertain.

Chl-a

  1. [Data — output param] Log-transform. Predict log f. NF: trivial. VSSGP: breaks Gaussian likelihood — need SVI / Laplace.
  2. [Data — input] Periodic harmonic (annual; bloom seasonality).
  3. [Loss] Two-stage advection-diffusion without source term. Captures bloom-edge propagation.
  4. [Loss] Learned source S_θ only if observed bloom edges are sharper than the reconstruction. NF wins here — putting an NN inside the loss is native.

Out-of-domain experiments

The most important test of the basis-axis informative prior is prediction outside the training footprint. Two recommended OOD splits:

For VSSGP, report the posterior mean and per-point standard deviation — the latter should grow as you move out of domain, with the right spectral content (visible by drawing a few posterior samples and inspecting their power spectrum). For NF, report a deep-ensemble mean ± std (5–10 independently trained networks); SIREN extrapolation is famously unstable[2] and the ensemble spread is the cheapest signal of that instability.


6. My take

The constraint-mechanism story for this project is three axes (basis / data / loss), not one, and they have very different leverage profiles for OOD prediction:

Axis ordering by OOD leverage: Data > Basis > Loss.

That ranking is specific to out-of-domain prediction and matters because it inverts the “in-domain” intuition.

This is the part of the picture I had wrong in earlier framings of the doc: I was treating soft physics as the workhorse and basis/data as additions. For an OOD-prediction project the order is the reverse.

Concrete recommendation for siren_vs_rff:

  1. Lock in all the data-axis moves first (preprocessing, periodic embeddings, scalar-latent parameterisation, derivative obs). These are the OOD-leverage moves and they apply identically to NF and VSSGP, so you can run them as the controlled-comparison baseline.
  2. Then ablate the basis-axis informative prior — VSSGP with MIOST init vs. VSSGP with uniform p(ω) vs. SIREN with default ω₀ vs. SIREN with Fourier-feature wrapper using the same p(ω). This is the cleanest test of the 00_siren_rff_vssgp.md spectral-budget claim, and the OOD region is where the differences should be most visible.
  3. Add loss-axis soft physics last, with collocation points on a grid that covers the OOD region as well as the training region — otherwise the constraint won’t help the metric you actually care about.

For the methods themselves: VSSGP wins on basis and on linear-loss; NF wins on Chl-a’s learned source term. Use both, additively, where each is strong. The siren_vs_rff comparison is most informative if it reports (method × axis × OOD-vs-in-domain) cells separately rather than a single headline RMSE — the headline number hides the place where the methods most differ.


References

Footnotes
  1. Krishnapriyan, A., Gholami, A., Zhe, S., Kirby, R., Mahoney, M. (2021). “Characterizing possible failure modes in physics-informed neural networks.” NeurIPS 34.

  2. Lipman, Y., Aharoni, M. (2021). “On the failure modes of implicit neural representations under extrapolation.” Workshop on Implicit Neural Representations.