Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

SIREN ↔ RFF ↔ VSSGP — Bayesian Fourier features for geoscience interpolation

SIREN, RFF, and Variational Sparse Spectrum GPs for geoscience

A reading of the SIREN / Random Fourier Feature (RFF) / Variational Sparse Spectrum GP (VSSGP) family from a geoscience-interpolation viewpoint. The thesis: SIREN and RFF are special cases of VSSGP with degenerate spectral priors — and for sparse, banded, irregularly-sampled geophysical fields (SSH, SST, SSS, ocean colour) the VSSGP framing is strictly more useful because the spectral structure is something we already know.

This note is split into seven layers, in increasing order of structural sophistication:

  1. §0 — applied-math foundations (Bochner, RFF as Monte Carlo, weight-space ↔ function-space duality, sample complexity, marginal likelihood as Occam’s razor).
  2. §1 — SIREN / RFF as special cases of VSSGP, with the four-rung hierarchy.
  3. §2 — why geoscience is the wrong domain for SIREN, with a spectral-budget argument.
  4. §3 — physical priors for SSH / SST / SSS / OC, including the operational-prior recipe (MIOST / DYMOST / DUACS → VSSGP).
  5. §4 — additive-component construction in VSSGP, with a worked SSH example.
  6. §5 — approximation-error and marginal-likelihood-compass results.
  7. §6 — the experiment plan (datasets, metrics, ablations, compute budget).

Companion: Enforcing physics in NF / VSSGP training — basis, data, loss — how to enforce physical constraints (basis / data / loss) on top of any of these methods, with explicit out-of-domain prediction in mind.


0. Why Any of This Works — Applied Foundations

Before the hierarchy of methods, it’s worth setting down four results that everything else rests on. Each is one line of theory plus one line of “what this buys you in practice.”

0.1 Bochner’s theorem — the bridge

For any continuous, shift-invariant, positive-definite kernel k(x, x') = κ(x − x') normalized so κ(0) = 1:

κ(τ)=exp(iω ⁣τ)p(ω)dω\kappa(\tau) = \int \exp(i\, \omega^{\!\top} \tau)\, p(\omega)\, d\omega

i.e. κ and a probability density p(ω) are a Fourier pair. Positive-definiteness of κ is equivalent to non-negativity of p.

Engineering consequence. Picking a kernel and picking a spectral density are the same act. Once you have either, the other is determined by an FFT. This is why you are allowed to design priors directly in the Fourier domain — which is exactly where physical knowledge of geoscience signals lives (annual cycles, mesoscale band, tidal frequencies).

Kernelp(ω)Geoscience use
Squared exponentialN(0, ℓ⁻²)smooth fields, no preferred scale
Matérn-νMultivariate Student-trough fields with controlled differentiability
Cosineδ(ω − ω₀)exact harmonic (annual, diurnal)
Spectral mixture (Wilson & Adams)Σ wᵢ N(μᵢ, Σᵢ)banded structure (mesoscale + gyre + tide)

0.2 RFF as Monte Carlo of the Bochner integral

Take the real form of the Bochner identity (κ is real, so we keep only the cosine):

κ(xx)=Eωp,  bU[0,2π] ⁣[2cos(ω ⁣x+b)cos(ω ⁣x+b)]\kappa(x - x') = \mathbb{E}_{\omega \sim p,\; b \sim U[0, 2\pi]}\!\left[\, 2 \cos(\omega^{\!\top} x + b)\, \cos(\omega^{\!\top} x' + b) \,\right]

Pull M samples (ωᵢ, bᵢ). The MC estimator is the inner product of the feature map

ϕ(x)=2/M[cos(ω1 ⁣x+b1),,cos(ωM ⁣x+bM)]k^(x,x)=ϕ(x) ⁣ϕ(x)\phi(x) = \sqrt{2/M}\, \bigl[\, \cos(\omega_1^{\!\top} x + b_1),\, \ldots,\, \cos(\omega_M^{\!\top} x + b_M) \,\bigr] \quad\Longrightarrow\quad \hat k(x, x') = \phi(x)^{\!\top} \phi(x')

Sample complexity. Rahimi & Recht (2008, “Uniform Approximation of Functions with Random Bases”) show that on a compact set X ⊂ ℝ^d of diameter R,

Pr ⁣[supx,xXk^(x,x)k(x,x)ε]    28(σpR/ε)2exp ⁣(Mε2/(4(d+2)))\Pr\!\left[\, \sup_{x, x' \in X}\, \bigl|\hat k(x, x') - k(x, x')\bigr| \geq \varepsilon \,\right] \;\leq\; 2^8 \bigl(\sigma_p R / \varepsilon\bigr)^2 \exp\!\bigl(-M \varepsilon^2 / (4(d + 2))\bigr)

so to drive uniform error below ε with high probability you need

M=O ⁣((d/ε2)log(σpR/ε))M = \mathcal{O}\!\bigl( (d / \varepsilon^2)\, \log(\sigma_p R / \varepsilon) \bigr)

Engineering consequence. M scales linearly with input dimension and inverse-squared in the kernel error you tolerate. Doubling the precision quadruples M. For SSH on (lon, lat, t) (d = 3) and 1% kernel error you typically need M in the low thousands — well within reach of a JAX dense matmul.

0.3 Weight-space ↔ function-space duality

If you put a Gaussian prior on the weights of the random feature model

f(x)=ϕ(x) ⁣w,wN(0,Σw)f(x) = \phi(x)^{\!\top} w, \qquad w \sim \mathcal{N}(0, \Sigma_w)

then f is itself a Gaussian process with mean 0 and covariance

Cov(f(x),f(x))=ϕ(x) ⁣Σwϕ(x)\mathrm{Cov}\bigl(f(x), f(x')\bigr) = \phi(x)^{\!\top} \Sigma_w\, \phi(x')

Setting Σ_w = σ_f² I recovers exactly the RFF kernel scaled by σ_f². So the two views are not analogies — they are the same model, written in different bases.

ViewObject you optimizeCost dominated by
Function-spacePosterior mean μ(x*) = k_*ᵀ (K + σ_n²I)⁻¹ yO(N³) for (K + σ_n²I)⁻¹
Weight-space (RFF)Posterior mean `μ(x*) = φ(x*)ᵀ E[wy]`

Engineering consequence. When M ≪ N (the geoscience regime — N ~ 10⁶, M ~ 10³), weight-space is asymptotically (N/M)² cheaper. The whole reason SSGP/VSSGP exists as a practical method, not just a theoretical bridge, is that the duality flips the cost direction.

0.4 Posterior in closed form (the bit that SIREN throws away)

For RFF/SSGP with Gaussian likelihood y = Φ w + ε, ε ~ N(0, σ_n²I):

Σpost=(Φ ⁣Φ/σn2+Σw1)1(M×M)μpost=ΣpostΦ ⁣y/σn2(M,)E[f]=ϕ(x) ⁣μpostVar[f]=ϕ(x) ⁣Σpostϕ(x)Var[y]=Var[f]+σn2\begin{aligned} \Sigma_{\text{post}} &= \bigl(\Phi^{\!\top} \Phi / \sigma_n^2 + \Sigma_w^{-1}\bigr)^{-1} \quad &(M \times M) \\ \mu_{\text{post}} &= \Sigma_{\text{post}}\, \Phi^{\!\top} y / \sigma_n^2 \quad &(M,) \\ \mathbb{E}[f^*] &= \phi(x^*)^{\!\top} \mu_{\text{post}} \\ \mathrm{Var}[f^*] &= \phi(x^*)^{\!\top} \Sigma_{\text{post}}\, \phi(x^*) \\ \mathrm{Var}[y^*] &= \mathrm{Var}[f^*] + \sigma_n^2 \end{aligned}

(Var[f*] is the latent function variance — what you use for CRPS / NLPD on the underlying field; Var[y*] adds the observation-noise term and is what you use when comparing predictions to noisy held-out measurements.)

Engineering consequence. This is one Cholesky on an M × M matrix. You get not just a point prediction, but a full predictive distribution at every test point — gap-filling uncertainty for free. SIREN replaces this entire block with argmin_w ‖y − Φw‖² + λ‖w‖² (i.e. ridge regression with a single tuned scalar λ), losing the per-point variance and the principled regularization-from-the-prior.

0.5 Marginal likelihood as automatic Occam’s razor

The model evidence (with Σ_w = σ_f² I)

logp(yΩ,σf,σn)=12y ⁣ ⁣(σf2ΦΦ ⁣+σn2I) ⁣1 ⁣y    12log ⁣σf2ΦΦ ⁣+σn2I    N2log2π\log p(y \mid \Omega, \sigma_f, \sigma_n) = -\tfrac{1}{2}\, y^{\!\top}\!\bigl(\sigma_f^2\, \Phi \Phi^{\!\top} + \sigma_n^2 I\bigr)^{\!-1}\! y \;-\; \tfrac{1}{2}\, \log \!\bigl|\sigma_f^2\, \Phi \Phi^{\!\top} + \sigma_n^2 I\bigr| \;-\; \tfrac{N}{2}\, \log 2\pi

decomposes into data-fit (first term) plus complexity penalty (log-determinant). Optimizing this w.r.t. Ω and amplitudes pulls spectral mass onto frequencies that explain y — but pays a log|·| cost for adding mass at frequencies that don’t help. There is no equivalent self-regularizing objective in MSE training of SIREN; ω₀ has to be tuned by hand.

Engineering consequence. The marginal likelihood gradient w.r.t. spectral points is exactly the signal that says “your spectral prior is wrong, move mass here”. This is what SSGP exploits; VSSGP softens point-optimized Ω to a variational posterior q(Ω) to prevent the well-known SSGP overfitting pathology (Lázaro-Gredilla et al. 2010, §5).


1. SIREN / RFF as Special Cases of VSSGP

Core claim: The SIREN and Random Fourier Features (RFF) methods from the neural networks community are special cases of Variational Sparse Spectrum GPs (VSSGP) and Random Feature Expansions (RFE) from the GP community — with uninformative/degenerate priors and no posterior.

Correspondence Table

Neural FieldGP CounterpartWhat’s Missing in the NN Version
RFF (Rahimi & Recht)Sparse Spectrum GP (fixed ω, no Bayes)No posterior, just a kernel approximation
SIREN (1-layer)VSSGP with specific spectral prior p(ω)Bayesian treatment of weights + frequencies
Deep SIRENDeep GP with sinusoidal RFEsUncertainty propagation across layers
ω₀ + init heuristicKernel lengthscale / spectral densityPrincipled derivation via Bochner’s theorem

RFF → Sparse Spectrum GP

Rahimi & Recht’s RFF is an MC approximation to a shift-invariant kernel via Bochner’s theorem:

k(x,y)    ϕ(x) ⁣ϕ(y),ϕ(x)=2/D[cos(ωi ⁣x+bi)],ωipspectral(ω)=F[k](ω)k(x, y) \;\approx\; \phi(x)^{\!\top}\phi(y), \qquad \phi(x) = \sqrt{2/D}\, \bigl[\cos(\omega_i^{\!\top} x + b_i)\bigr], \qquad \omega_i \sim p_{\text{spectral}}(\omega) = \mathcal{F}[k](\omega)

The Sparse Spectrum GP (Lázaro-Gredilla et al. 2010) does the same but optimizes the ωᵢ via marginal likelihood. VSSGP puts a full variational distribution q(ω) over the spectral points.

SIREN → VSSGP with Box Spectral Prior

A 1-layer SIREN:

f(x)=W2sin ⁣(ω0W1x+b1)f(x) = W_2\, \sin\!\bigl(\omega_0\, W_1\, x + b_1\bigr)

with W₁ ~ Uniform(-√(6/n), √(6/n)) is implicitly assuming a spectral prior:

p(ω)Uniform ⁣[ω06/n,  ω06/n]dp(\omega) \propto \mathrm{Uniform}\!\bigl[-\omega_0 \sqrt{6/n},\; \omega_0 \sqrt{6/n}\bigr]^d

This is a box spectral prior — corresponding to a product sinc kernel. VSSGP would let you:

  1. Infer the spectral support instead of heuristically choosing ω₀
  2. Place a proper Gaussian prior on W₂ → closed-form posterior
  3. Optimize the ELBO rather than MSE → built-in regularization

The “Better Constrained” Argument

IssueSIRENVSSGP
ω₀ tuningHeuristic, fragileMarginal likelihood gradient
Weight priorImplicit (optimizer trajectory)Explicit N(0, σ²I)
Function spaceNTK (implicit)Named RKHS
UncertaintyNonePosterior q(w)

SIREN is VSSGP with a degenerate (improper flat) weight prior and frequencies fixed by initialization rather than optimized/marginalized.

The Four-Rung Hierarchy

All four methods share the same forward model f(x) = φ(x)ᵀ w with φ built from Ω. They differ only in what is treated as fixed, point-estimated, or distributional:

MethodΩ (frequencies)w (weights)ObjectiveCost per train step
RFF (Rahimi-Recht)sampled once from p(ω), fixedridge regression closed-form‖y − Φw‖² + λ‖w‖²O(N M² + M³) once
SSGP (Lázaro-Gredilla 2010)point-optimized via marginal likelihoodclosed-form posterior N(μ, Σ)log p(y | Ω, θ)O(N M² + M³) per gradient step
VSSGP (Gal & Turner 2015)variational q(Ω) = N(μ_Ω, σ_Ω²) with prior p(ω)variational q(w) (or marginalized)ELBOO(N M² + M³) per ELBO step
SIREN (Sitzmann 2020)fixed by init, scaled by ω₀point-optimized via SGD‖y − Φw‖² (no prior)O(N M) per SGD step (depth-multiplied)

The progression is decreasing assumptions, increasing cost — but for sparse geoscience data, the cost is dominated by M, not by which rung you’re on, so going to the top rung is essentially free relative to the compute you’ve already committed.

What SIREN’s Init Heuristic Actually Sets

In a 1-layer SIREN, W₁ ~ U(−√(6/n), √(6/n)) and the activation is sin(ω₀ W₁ x). The effective spectral measure is

ωeff=ω0W1p(ω)=U ⁣[ω06/n,  +ω06/n]d\omega_{\text{eff}} = \omega_0 \cdot W_1 \quad\Longrightarrow\quad p(\omega) = U\!\bigl[\,-\omega_0 \sqrt{6/n},\; +\omega_0 \sqrt{6/n}\,\bigr]^d

— i.e. a d-dimensional uniform box. The Fourier dual of a box is a sinc, so SIREN’s implicit prior kernel is a product of sinc functions. This kernel has the worst-of-both-worlds spectral signature: it puts non-trivial mass on every frequency up to ω₀ (no preference for known scales) and zero mass beyond ω₀ (cannot represent anything finer-scale than the chosen cutoff). For a geoscience signal where the energy lies in a known narrow band, this is exactly inverted from what you want.


2. Why Geoscience Data is the Wrong Domain for SIREN

Spectral Budget — How Many of Your M Features Land Where the Signal Is?

Suppose the signal’s true spectral support is a set B ⊂ ℝᵈ (e.g. for SSH: a band around |ω| ∈ [2π/300km, 2π/50km], plus a delta at the annual frequency). Define the useful fraction

η(p)=Bp(ω)dω\eta(p) = \int_B p(\omega)\, d\omega

i.e. the probability that a single feature lands in the band where signal lives. Then by linearity of expectation, the expected number of useful features out of M is M_useful = M · η(p).

For SIREN’s box prior in d = 3 (lon, lat, t), with ω₀ chosen large enough to cover annual + mesoscale (ω_max ≈ 2π/50km), the useful fraction is roughly

ηSIREN    vol(B)/vol ⁣([ωmax,ωmax]3)    103102\eta_{\text{SIREN}} \;\approx\; \mathrm{vol}(B) \,/\, \mathrm{vol}\!\bigl([-\omega_{\max}, \omega_{\max}]^3\bigr) \;\approx\; 10^{-3}\text{–}10^{-2}

For a spectral-mixture prior centered on the band, η_VSSGP → 1 by construction. The implication: with M = 1024 features, SIREN gets ~10 features in the right place; VSSGP gets ~1024. To match VSSGP’s effective resolution, SIREN needs M_SIREN ≈ M_VSSGP / η, i.e. two orders of magnitude wider at this d and band geometry. Capacity isn’t free — wider networks cost more flops and overfit harder on sparse data.

This is the cleanest engineering argument: the cost of an uninformative prior is paid in spent feature budget, and that budget isn’t recoverable by training longer.

The Spectral Structure is Known A Priori

FieldKnown Spectral Features
Ocean SSHMesoscale eddy band ~100–300 km, internal tides, inertial oscillations
AtmosphereDiurnal/semidiurnal harmonics, synoptic scales, planetary waves
Land surface tempAnnual + semiannual cycles, diurnal, known spatial correlation lengths
Soil moistureSeasonal + storm-scale, exponential spatial decay
MethaneSeasonal biogenic signal, specific plume spatial scales

SIREN’s ω₀ is a wild guess at this. VSSGP lets you encode it directly:

# Informative spectral prior for ocean SSH
ω ~ MixtureOfGaussians(
    μ₁ = 2π / 200km,   # mesoscale eddy peak
    μ₂ = 2π / 500km,   # large-scale gyre
)

# vs SIREN's implicit prior:
ω ~ Uniform[-ω₀√(6/n), ω₀√(6/n)]   # knows nothing

Geoscience Data is Sparse and Irregular

SIREN was designed for dense, regular signals (images, SDFs). Geoscience data is almost never this:

SIREN has no principled way to handle this. VSSGP gives a posterior predictive p(f* | X*, X, y) — proper uncertainty in gaps.

Physical Constraints Map to the GP Framework Naturally

SIREN Convergence is Structurally Fragile

The initialization solves two coupled problems with a single heuristic (ω₀):

  1. Keep activation distributions stable through layers
  2. Cover the target frequency range

In VSSGP this doesn’t exist as a problem — spectral points are initialized from your physical prior and optimized via marginal likelihood.


3. Physical Priors for Ocean Variables

3.0 Operational SSH mapping schemes ARE fitted spectral priors

Three operational schemes — DUACS, MIOST/MASSH, and DYMOST — have spent the last 30 years fitting prior covariances to global altimetry. Their tuned hyperparameters drop into VSSGP with almost no translation work. Treating them as “the spectral prior we’d otherwise have to learn from scratch” is the single biggest free win available for SSH.

3.0.1 MIOST → spectral-mixture prior with Gabor components

MIOST (Multi-scale Inversion of Ocean Surface Topography, Ardhuin et al. 2021; Le Guillou et al. 2021) decomposes SSH into a sum of Gabor wavelets — plane waves modulated by Gaspari-Cohn windows — at three to four scales:

MIOST componentSpatial scale L_sTemporal scale τVariance share
Large-scale (gyre)800–1500 km60–120 days~30–40%
Mesoscale150–300 km20–40 days~40–50%
Sub-mesoscale (SWOT-era)30–80 km3–10 days~10–20%
Equatorial / waveguide200–500 km × 50–100 km (anisotropic)30–60 daysregional only

A Gabor wavelet g(x) = exp(i ω₀ᵀx) · w(x − x₀) with window w of width L_s has Fourier transform ĝ(ω) = ŵ(ω − ω₀) — i.e. a Gaussian-like bump centred on ω₀ = 2π/L_s with bandwidth ~1/L_s. So MIOST’s wavelet dictionary is, in spectral terms, a sum of Gaussians:

pMIOST(ω)=nvarnkvark  N ⁣(ω;  2π/Ls,n,  σn2),σn(2π/Ls,n)0.3p_{\text{MIOST}}(\omega) = \sum_n \frac{\mathrm{var}_n}{\sum_k \mathrm{var}_k}\; \mathcal{N}\!\bigl(\omega;\; 2\pi/L_{s,n},\; \sigma_n^2\bigr), \qquad \sigma_n \approx (2\pi/L_{s,n}) \cdot 0.3

(bandwidth ~30% of the centre frequency).

This is literally a spectral mixture prior à la Wilson & Adams (2013), with the centres and weights pre-fitted by the altimetry community.

Engineering consequence. In VSSGP you don’t need to discover that the energy lives in a gyre + mesoscale + sub-mesoscale band — you initialize μ_Ω at the MIOST centres, allocate M_n features per component proportionally to the variance share, and let the marginal-likelihood gradient refine the centres locally instead of finding them globally. This converts a non-convex spectral search into local refinement.

3.0.2 DYMOST → anisotropic, flow-aligned spectral measure

DYMOST (Ubelmann et al. 2015, 2020) advects the SSH covariance along a Lagrangian first-guess velocity field u_g, derived from a 1.5-layer QG model run on the prior mean. The effective covariance has two properties that a stationary p(ω) cannot capture:

  1. Anisotropy: along-stream lengthscale L_∥ ≈ 2–4 × L_⊥ in strong currents (Gulf Stream, Kuroshio, ACC).
  2. Spatial dependence: L_∥(x), L_⊥(x) track local eddy kinetic energy.

The spectral dual of along-stream stretching is wavenumber compression in the perpendicular direction:

Σω(x)=R ⁣(θflow(x)) ⁣diag ⁣((2π/L)2,  (2π/L)2)R ⁣(θflow(x))\Sigma_\omega(x) = R\!\bigl(\theta_{\text{flow}}(x)\bigr)^{\!\top} \cdot \mathrm{diag}\!\bigl(\,(2\pi/L_\perp)^2,\; (2\pi/L_\parallel)^2 \,\bigr) \cdot R\!\bigl(\theta_{\text{flow}}(x)\bigr)

(wavenumber covariance squashed perpendicular to the flow direction).

In a VSSGP this becomes a location-conditioned spectral covariance: instead of a single global Σ_ω, sample ωᵢ from N(0, Σ_ω(x_patch)) per patch, where θ_flow and L_⊥/L_∥ come from a low-pass-filtered first-guess (or a climatology like AVISO MDT).

Engineering consequence. This pairs naturally with the patch-decomposition recipe in 03_global_scaling_patches. Each patch gets its own Σ_ω(x_patch) from the DYMOST first-guess; the global VSSGP becomes a mixture of patch-local anisotropic VSSGPs, stitched with Gaspari-Cohn weights.

3.0.3 DUACS → empirical variance and lengthscale fields

DUACS (the operational AVISO/CMEMS product, Pujol et al. 2016; Taburet et al. 2019) fits per-pixel variance σ²(x) and isotropic lengthscale L(x) from a 30-year SLA reanalysis. These are public data products — cmems_obs-sl_glo_phy-mdt_my_allsat-l4-duacs_P1Y and the auxiliary lengthscale grids.

Two direct uses in VSSGP:

DUACS fieldVSSGP roleWhere it goes
σ²(x) (SLA variance)Spatially varying weight-prior amplitudeσ_w(x) = √σ²(x) per patch
L(x) (correlation lengthscale)Spatial scaling of Σ_ωΣ_ω(x) ∝ (2π/L(x))² I
θ_flow(x), L_∥/L_⊥(x) (DYMOST aux)Anisotropy of Σ_ωrotation + axis-ratio in §3.0.2

Engineering consequence. Three operationally-tuned fields — variance, isotropic lengthscale, anisotropy — give you a fully location-conditioned VSSGP prior with no free hyperparameters left to tune. The only thing the marginal-likelihood gradient still needs to do is local refinement around the operational defaults; everything global has been done by 30 years of community fitting.

3.0.4 Practical integration recipe

The handoff is short:

# 1. Load operational priors (one-time, per region)
σ_field   = load_duacs_variance(region)         # (H, W)
L_field   = load_duacs_lengthscale(region)      # (H, W)
θ_field   = load_dymost_flow_angle(region)      # (H, W)
ratio     = load_dymost_axis_ratio(region)      # (H, W) ; L_par / L_perp

# 2. Per patch (see 03_global_scaling_patches), build patch-local prior
def patch_prior(x_patch):
    σ_patch  = σ_field.interp(x_patch)
    L_patch  = L_field.interp(x_patch)
    θ_patch  = θ_field.interp(x_patch)
    r_patch  = ratio.interp(x_patch)

    # MIOST-style mixture, scaled to local lengthscale
    centres = jnp.array([
        2*jnp.pi / (L_patch * 5.0),    # gyre   (5x lengthscale)
        2*jnp.pi / (L_patch * 1.0),    # meso   (1x lengthscale)
        2*jnp.pi / (L_patch * 0.25),   # sub    (0.25x lengthscale)
    ])
    weights = jnp.array([0.35, 0.45, 0.20])  # variance shares

    # DYMOST-style anisotropy
    # Spectral covariance scales like (2π/L)², and r_patch = L_∥/L_⊥ ≥ 1, so
    # the parallel axis carries 1/r² *less* wavenumber variance than the
    # perpendicular axis (the field varies more rapidly across-stream).
    R = rotation(θ_patch)
    Σ_ω = R.T @ jnp.diag([1.0, 1.0 / r_patch**2]) @ R   # [⊥, ∥]

    return centres, weights, Σ_ω, σ_patch

# 3. Initialize VSSGP variational params μ_Ω, σ_Ω from this prior
#    Train (refines locally; doesn't search globally)

The bookkeeping is the same as a generic VSSGP — just the initialization and prior change. No new code paths.

3.0.5 What the operational schemes don’t give you (and VSSGP does)

The summary: MIOST and DYMOST tell you where to put the spectral mass; DUACS tells you how it varies in space; VSSGP tells you the posterior. The three together are strictly more than any one alone.

SSH (Sea Surface Height)

# Spatial: power-law spectral density
p(ω) ∝ |ω|^{-α},   α ≈ 4–5   (mesoscale range)

# Mixture:
p(ω) = w₁ · N(2π/L_gyre, σ₁)      # L_gyre ~ 1000 km
      + w₂ · N(2π/L_meso, σ₂)      # L_meso ~ 100–300 km
      + w₃ · N(2π/L_tide, σ₃)      # L_tide ~ 100 km

# Temporal:
p(ω_t) = w₁ · δ(2π/365.25)         # annual
        + w₂ · δ(2π/182.6)          # semiannual
        + w₃ · N(0, σ_meso)         # mesoscale (weeks-months)

# Weight prior (anomaly around mean dynamic topography)
w ~ N(0, σ_w²),    σ_w ~ 0.1–0.3 m

# Likelihood
y | f ~ N(f(x), σ_n²),    σ_n ~ 0.02–0.05 m
# Note: along-track altimetry has correlated noise → correlated noise kernel

SST (Sea Surface Temperature)

# Temporal — seasonal dominates
p(ω_t) = w₁ · N(2π/365.25, σ₁)    # annual (dominant)
        + w₂ · N(2π/182.6, σ₂)     # semiannual
        + w₃ · N(2π/1, σ₃)         # diurnal (~0.5–1°C)
        + w₄ · N(0, σ_slow)         # interannual (ENSO)

# Weight prior (as anomaly from climatology)
w ~ N(0, σ_w²),    σ_w ~ 2–5°C mid-latitude, ~1–2°C tropics

# Likelihood — Student-T for cloud-contaminated IR retrievals
y | f ~ StudentT(f(x), σ², ν),    ν ~ 4–7

SSS (Sea Surface Salinity)

# Temporal
p(ω_t) = w₁ · N(2π/365.25, σ₁)    # seasonal (precipitation cycle)
        + w₂ · N(0, σ_slow)         # interannual

# Weight prior — non-stationary near rivers
w ~ N(0, σ_w²),    σ_w ~ 0.5–1.5 PSU open ocean
                        ~ 3–5 PSU near river mouths

# Likelihood — SMOS/SMAP have RFI contamination
y | f ~ StudentT(f(x), σ², ν),    ν ~ 3–5
σ_n ~ 0.2–0.5 PSU per retrieval

OC / Chlorophyll-a

Key prior: work in log space. Chl-a is log-normally distributed across the full range (0.01 to >100 mg/m³).

# Model log(Chl-a), not Chl-a directly
g(x) = log(Chl-a(x))
g ~ VSSGP(0, k)

# Temporal
p(ω_t) = w₁ · N(2π/365.25, σ₁)    # spring bloom — very strong
        + w₂ · N(2π/182.6, σ₂)     # secondary bloom
        + w₃ · N(0, σ_slow)         # interannual

# Weight prior (in log space)
w ~ N(0, σ_w²),    σ_w ~ 1.0–1.5 log₁₀ units

# Likelihood — log-normal is canonical
log(y) | f ~ N(f(x), σ_n²),    σ_n ~ 0.1–0.3 log₁₀ units

Summary Table

VariableTransformSpectral PriorWeight σLikelihood
SSHnone (anomaly)Power-law + mixture at eddy/tide scales0.1–0.3 mGaussian (or correlated along-track)
SSTnone (anomaly)Strong annual + diurnal harmonics2–5°CStudent-T (ν~5)
SSSnone (anomaly)Seasonal + non-stationary σ near rivers0.5–5 PSUStudent-T (ν~3–4)
OC / Chl-alogAnnual bloom + patchy spatial1.0–1.5 log unitsLog-normal

4. How Additive Components Are Injected into VSSGP

The Core SSGP Equation

k(x, x') ≈ φ(x)ᵀ φ(x')

φ(x) = √(2/M) cos(Ω x + b)

where:
  x  : (D,)       input coordinates
  Ω  : (M, D)     spectral points, each row ωᵢ ~ p(ω)
  b  : (M,)       phases, bᵢ ~ Uniform(0, 2π)
  φ  : (M,)       feature vector for one input

Batched over N inputs:
  X  : (N, D)
  φ(X) = √(2/M) cos(X @ Ωᵀ + b)   : (N, M)

Model is Bayesian linear regression in feature space:

w   : (M,)        w ~ N(0, σ_w² I)
f   : (N,)        f = φ(X) @ w   =  (N, M) @ (M,) → (N,)
y   : (N,)        y ~ N(f, σ_n² I)

Additive Kernel = Concatenated Feature Maps

φ(X) = [ φ₁(X) | φ₂(X) | φ₃(X) ]     # (N, M1+M2+M3)
w    = [ w₁    | w₂    | w₃    ]       # (M1+M2+M3,)

f = φ(X) @ w
  = φ₁(X) @ w₁  +  φ₂(X) @ w₂  +  φ₃(X) @ w₃   # (N,) each

Full SSH Example with D=3 inputs (lon, lat, t)

import jax.numpy as jnp
import jax.random as jr

# hyperparameters (physical prior knowledge)
L_gyre   = 1000.0   # km
L_meso   = 200.0    # km
T_annual = 365.25   # days
T_meso   = 60.0     # days
σ_slow   = 1/500.0  # cycles/day

M1, M2, M3 = 32, 64, 128   # slow | annual×spatial | mesoscale
D = 3                        # (lon, lat, t)


# ================================================================
# Step 1: Sample spectral points Ωⱼ for each component
# ================================================================

def sample_spectral_points(key):

    k1, k2, k3, k4, k5 = jr.split(key, 5)

    # Component 1: slow temporal trend
    # prior: ω_t ~ N(0, σ_slow²),  ω_lon = ω_lat = 0
    ω_t_slow = jr.normal(k1, (M1, 1)) * σ_slow    # (M1, 1)
    ω_zeros  = jnp.zeros((M1, 2))                  # (M1, 2)
    Ω1       = jnp.concatenate(                     # (M1, 3)
                   [ω_zeros, ω_t_slow], axis=-1)
    # Ω1[i] = (0, 0, ω_t_i)  → pure temporal features

    # Component 2: annual cycle × large-scale spatial
    # prior: ω_t ~ N(2π/T_annual, σ_annual²)   ← centred on annual freq
    #        ω_lon, ω_lat ~ N(0, (2π/L_gyre)²)
    σ_annual = 2*jnp.pi / T_annual * 0.1           # 10% bandwidth
    σ_gyre   = 2*jnp.pi / L_gyre

    ω_t_ann  = jr.normal(k2, (M2, 1)) * σ_annual \
               + 2*jnp.pi/T_annual                  # (M2, 1)
    ω_s_ann  = jr.normal(k3, (M2, 2)) * σ_gyre     # (M2, 2)
    Ω2       = jnp.concatenate(                     # (M2, 3)
                   [ω_s_ann, ω_t_ann], axis=-1)
    # Ω2[i] = (ω_lon_i, ω_lat_i, ω_t_i) with ω_t near annual freq

    # Component 3: mesoscale spatiotemporal
    # prior: ω ~ N(0, diag(σ_meso_s², σ_meso_s², σ_meso_t²))
    σ_meso_s = 2*jnp.pi / L_meso                   # mesoscale wavenumber
    σ_meso_t = 2*jnp.pi / T_meso                   # mesoscale frequency
    scales   = jnp.array([σ_meso_s, σ_meso_s, σ_meso_t])  # (3,)
    Ω3       = jr.normal(k4, (M3, D)) * scales      # (M3, 3)
    # Ω3[i] = (ω_lon_i, ω_lat_i, ω_t_i) fully coupled

    return Ω1, Ω2, Ω3
    # shapes: (M1,3), (M2,3), (M3,3)


# ================================================================
# Step 2: Build feature maps for each component
# ================================================================

def make_features(X, Ω1, Ω2, Ω3):
    """
    X    : (N, 3)   input coords (lon, lat, t)
    Ωj   : (Mj, 3)  spectral points for component j
    """
    key_b = jr.PRNGKey(42)
    kb1, kb2, kb3 = jr.split(key_b, 3)

    b1 = jr.uniform(kb1, (M1,), minval=0, maxval=2*jnp.pi)  # (M1,)
    b2 = jr.uniform(kb2, (M2,), minval=0, maxval=2*jnp.pi)  # (M2,)
    b3 = jr.uniform(kb3, (M3,), minval=0, maxval=2*jnp.pi)  # (M3,)

    # X @ Ωⱼᵀ : (N,3) @ (3,Mj) → (N, Mj)
    φ1 = jnp.sqrt(2/M1) * jnp.cos(X @ Ω1.T + b1)   # (N, M1)
    φ2 = jnp.sqrt(2/M2) * jnp.cos(X @ Ω2.T + b2)   # (N, M2)
    φ3 = jnp.sqrt(2/M3) * jnp.cos(X @ Ω3.T + b3)   # (N, M3)

    φ  = jnp.concatenate([φ1, φ2, φ3], axis=-1)     # (N, M1+M2+M3)
    return φ, (φ1, φ2, φ3)


# ================================================================
# Step 3: Bayesian linear model (the SSGP model)
# ================================================================

def ssgp_model(X, y, Ω1, Ω2, Ω3):
    """
    X  : (N, 3)
    y  : (N,)
    """
    M = M1 + M2 + M3   # 224

    φ, _ = make_features(X, Ω1, Ω2, Ω3)   # (N, M)

    # weight prior — amplitude scale per component
    σ_w = jnp.concatenate([
        jnp.full(M1, 0.20),    # (M1,)  slow trend
        jnp.full(M2, 0.15),    # (M2,)  annual cycle
        jnp.full(M3, 0.10),    # (M3,)  mesoscale
    ])                          # (M,)

    w   = numpyro.sample("w", dist.Normal(jnp.zeros(M), σ_w))   # (M,)
    f   = φ @ w                                                   # (N,)

    σ_n = numpyro.sample("σ_n", dist.HalfNormal(0.03))
    numpyro.sample("y", dist.Normal(f, σ_n), obs=y)


# ================================================================
# VSSGP extension: spectral points become variational parameters
# ================================================================

# Replace fixed Ω3 with variational distribution q(ω) = N(μ_ω, diag(σ_ω²))
# μ_Ω3 = numpyro.param("μ_Ω3", Ω3_init)           # (M3, 3)
# σ_Ω3 = numpyro.param("σ_Ω3", 0.1*jnp.ones(...), constraint=positive)
# Ω3   = numpyro.sample("Ω3", dist.Normal(μ_Ω3, σ_Ω3))  # (M3, 3)
# → spectral points move to explain data, regularized toward physical prior

Shape Flow Summary

X       : (N, 3)          ← lon, lat, t for N observations

Ω1      : (M1, 3)         ← spectral pts, slow component
Ω2      : (M2, 3)         ← spectral pts, annual component
Ω3      : (M3, 3)         ← spectral pts, mesoscale component

φ1      : (N, M1)         ← features, slow
φ2      : (N, M2)         ← features, annual
φ3      : (N, M3)         ← features, mesoscale

φ       : (N, M1+M2+M3)   ← concatenated features

w       : (M1+M2+M3,)     ← weights,  w ~ N(0, diag(σ_w²))

f       : (N,)            ← f = φ @ w
y       : (N,)            ← y ~ N(f, σ_n²)

Physical knowledge lives entirely in how Ωⱼ are constructed — their initialization, prior distribution, and (in VSSGP) constraints on how far they move. Everything downstream is linear algebra.


5. Approximation Error and the Marginal-Likelihood Compass

Two more results worth having in your back pocket — both engineering-flavored, both useful for sizing experiments.

5.1 Mercer truncation: how much of the kernel are you keeping?

The kernel admits a Mercer expansion k(x, x') = Σᵢ λᵢ ψᵢ(x) ψᵢ(x') with λ₁ ≥ λ₂ ≥ …. Truncating to the top M eigenfunctions gives an approximation error

‖k − k_M‖² = Σ_{i > M} λᵢ²

For Bochner-type kernels the spectrum’s tail is set by the smoothness of the field (Matérn-ν tails like (1 + |ω|²ℓ²)^{−(ν + d/2)}). A field with high ν (e.g. the mean SSH after removing eddies) needs far fewer features for a target tolerance than a rough field (e.g. instantaneous SSH including mesoscale).

Engineering consequence. Pick M from the kernel tail you’ve targeted, not from a rule of thumb. For SSH-mesoscale (ν ≈ 3/2, ℓ ≈ 100 km, d = 3) on a 1000 × 1000 × 30 grid, M ≈ 1024 captures ≥ 99% of tr(K). Doubling ν (smoother field) drops the same coverage to M ≈ 256.

5.2 Marginal likelihood gradient as a spectral compass

For SSGP, differentiating the log-marginal-likelihood w.r.t. a single spectral point ωᵢ:

∂/∂ωᵢ log p(y | Ω) = (yᵀ A⁻¹ ∂Φ/∂ωᵢ wᵢ_eff) − tr(A⁻¹ ∂Φ/∂ωᵢ Φᵀ)

where A = σ_f² Φ Φᵀ + σ_n² I and wᵢ_eff is the contribution of feature i to the posterior. The first term rewards moving ωᵢ toward frequencies where the residual y − Φ μ_post has spectral mass; the second penalizes redundancy with already-placed features.

Engineering consequence. You don’t need to know the right spectral prior in advance — running a few epochs of SSGP/VSSGP starting from a generic prior reveals it. Inspect the histogram of optimized Ω after training and you have an empirical estimate of the true spectral density. This is impossible to extract from a trained SIREN.

5.3 When to not trust this argument

Three failure modes of the “VSSGP > SIREN for geoscience” thesis worth flagging honestly:

Failure modeWhat goes wrongFix
Strongly non-stationary fields (e.g. coastal SSH, frontal SST)A single global p(ω) is the wrong object — local lengthscale variesDeep GP / mixture-of-experts SSGP / patchwise approach (see 03_global_scaling_patches)
Highly nonlinear forward models (e.g. radiative transfer, log-Chl-a beyond linear regime)RFE/SSGP is linear-in-features; nonlinearity needs depthDeep RFE (Cutajar 2017) — compose RFF layers as a deep GP
Massive M regimes (M > 10⁴)O(M³) posterior covariance dominatesSwitch to inducing points / SVGP, or matrix-free CG (see 01_efficient_machinery)

In the first two cases SIREN’s flexibility — fully learned features, deep nonlinearity — is genuinely useful. The argument here is not “VSSGP wins everywhere” but “for the dominant geoscience regime (sparse, irregular, banded spectrum, mostly stationary within a region) the structural advantages compound”.


6. Experiment Planning — SSH, SST, SSS, OC

This section turns the theoretical argument into a concrete benchmark. Four ocean variables, four regional zoom levels, four classes of metrics, four-rung method ladder. Every combination should be a directly runnable experiment in gaussx/pyrox.

6.1 What we’re trying to demonstrate

ClaimTest
C1. Informative p(ω) outperforms uninformative p(ω) at fixed MRFF (uniform) vs SSGP-with-MIOST-init vs SIREN, all at M = 1024
C2. Posterior > point estimate for sparse dataSSGP point predictions vs VSSGP credible intervals — measure CRPS gap
C3. Operational priors transferInit VSSGP from MIOST/DUACS in Region A, evaluate on Region B without re-fitting hyperparameters
C4. The hierarchy ranks by sample efficiencyTrain all four methods on N/k for k ∈ {1, 2, 4, 8}; plot RMSE vs N — gap should widen as N shrinks
C5. Spectral structure is preserved, not just RMSECompute power-spectral coherence — VSSGP should match the truth across the full mesoscale band; SIREN drops off
C6. Derived physical quantities are improvedGeostrophic currents from ∇η̂ should match drifters better when prior is informative

6.2 Datasets and regions

Four nested zoom levels — same hierarchy of regions across all four variables, so comparisons stay on-axis:

ZoomRegionLon × LatWhy
Z1 — smallMediterranean[-6, 36] × [30, 46]Eddy-rich, well-instrumented, manageable N
Z2 — mediumGulf Stream extension[-80, -40] × [25, 50]Strong anisotropy → tests DYMOST-style flow-aligned prior
Z3 — largeNorth Atlantic[-80, 0] × [10, 60]Cross-regime: subtropical gyre + Gulf Stream + sub-polar
Z4 — globalGlobal ocean[-180, 180] × [-80, 80]Tests patch-decomposition stitching from 03_global_scaling_patches

Per-variable data products (CMEMS / NASA / ESA, all freely accessible):

VariableSparse obs (training)Gridded reference (eval)In-situ (independent eval)
SSHSEALEVEL_GLO_PHY_L3_MY_008_062 (nadir altimetry, ~3×10⁵ obs/day) + _069 (SWOT KaRIn, ~1.7×10⁶ obs/day)DUACS L4 SEALEVEL_GLO_PHY_L4_MY_008_047tide gauges INSITU_GLO_PHY_SSH_DISCRETE_NRT_013_059, GDP drifters (for u_g)
SSTGHRSST L2P MODIS-Aqua/Terra, AVHRR-19/MetOpOSTIA L4 SST_GLO_SST_L4_NRT_OBSERVATIONS_010_001Argo floats, drifting buoys (NDBC)
SSSSMOS L2 v700, SMAP RSS V5 (~6×10⁴ obs/day each)ESA SMOS L4 BEC, Multi-Mission Optimal Interpolated SSSArgo (calibrated), TSG underway
OC / Chl-aOC-CCI L3 v6.0 daily (sun-synch + cloud gaps)OC-CCI L4 monthly compositeBGC-Argo Chl fluorometers, in-situ HPLC (NASA SeaBASS)

6.3 Experimental protocols

Three protocols, each isolating a different failure mode:

6.3.1 OSSE (controlled — ground truth available)

Use a high-resolution model run as truth; sample synthetic observations at real altimeter / SMAP / drifter geometries.

This is the gold standard for the spectral-coherence and effective-resolution metrics — they require dense ground truth.

6.3.2 Leave-one-track-out (real data — independent altimeter)

For SSH only, the SSH community standard (Ballarotta et al. 2019). Train on {SARAL, Sentinel-3A/B, Cryosat-2, Jason-3} minus one; predict the held-out altimeter’s tracks; evaluate at observation points.

6.3.3 In-situ withholding (real data — independent platform)

Train on satellite obs only; evaluate on tide gauges (SSH) / Argo (SST, SSS) / BGC-Argo (Chl-a). Independent measurement system, so satellite biases don’t leak.

ProtocolProsConsUse for
OSSEDense truth, controlled noiseModel-truth mismatchSpectral metrics, calibration
Leave-one-trackReal data, real noiseSame instrument familyEffective resolution
In-situTruly independentSparse, biased to coastsBias detection, regional skill

6.4 Metrics — four classes

The interpolation community routinely under-reports metrics; a single RMSE hides the differences between methods that matter most. Four families, each catching a different failure mode:

6.4.1 Point accuracy

Standard but necessary:

RMSE  = sqrt( mean( (ŷ − y_true)² ) )
MAE   = mean( |ŷ − y_true| )
bias  = mean( ŷ − y_true )
nRMSE = RMSE / std(y_true)               # cross-region comparable
R²    = 1 − var(ŷ − y_true) / var(y_true)

Per-variable target (OSSE on Z2 Gulf Stream extension, 1 month, full SWOT-era):

VariableDUACS-class baseline RMSEVSSGP targetThreshold for “informative prior helps”
SSH4.5 cm< 3.5 cmnRMSE drop ≥ 15%
SST0.45 K< 0.30 KnRMSE drop ≥ 25%
SSS0.25 PSU< 0.20 PSUnRMSE drop ≥ 15%
log Chl-a0.30 log₁₀< 0.22 log₁₀nRMSE drop ≥ 20%
6.4.2 Probabilistic / calibration

Point methods (RFF-ridge, SIREN) cannot compute these — that’s the point.

NLPD  = − mean( log p(y_true | x*) )
      # for Gaussian: ½ log(2π σ²) + ½ (ŷ − y)² / σ²

CRPS  = mean( ∫ ( F(z; x*) − 𝟙{z ≥ y_true} )² dz )
      # for Gaussian-posterior: closed-form CRPS

PIT   = F(y_true ; x*) ∈ [0, 1]
      # histogram should be Uniform[0, 1] under correct calibration

Reliability:
   bin predictions by quantile q; check empirical coverage = q
   ECE = mean( |q − coverage(q)| )

CRPS is the headline probabilistic metric — proper, scale-aware, low-dimensional. NLPD is sensitive to outlier underprediction (heavy tails) so it disambiguates Gaussian vs. Student-T likelihoods.

6.4.3 Spectral metrics

This is where SIREN is expected to do badly even at competitive RMSE — it gets the spatial autocorrelation wrong.

Power spectral density (PSD):
   PSD_pred(k)  = |FFT(ŷ − ⟨ŷ⟩)|²
   PSD_true(k)  = |FFT(y_true − ⟨y_true⟩)|²
   PSD_error(k) = |FFT(ŷ − y_true)|²

Spectral coherence (Ballarotta 2019, the SSH community standard):
   T(k) = 1 − PSD_error(k) / PSD_true(k)

Effective resolution L_eff:
   smallest scale where T(k) ≥ 0.5
   i.e. signal dominates noise

Spectral RMSE:
   ‖ log PSD_pred − log PSD_true ‖₂   # in log space, by decade

Per-variable target on Z2 (Gulf Stream, OSSE):

VariableReference L_eff (DUACS-class)VSSGP target
SSH100–150 km< 80 km (resolves mesoscale)
SST60 km< 30 km (resolves frontal scale)
SSS200 km< 150 km
Chl-a (log)150 km< 80 km
6.4.4 Physical / derived-quantity metrics

For SSH especially, the derived quantities matter more than η itself — operational users want currents, divergence, and vorticity:

Geostrophic velocity:
   u_g = − (g/f) ∂η/∂y     v_g = (g/f) ∂η/∂x
   RMSE_u = sqrt( mean( (û − u_drifter)² ) )      # vs GDP drifters

Vorticity:
   ζ = ∂v/∂x − ∂u/∂y
   RMSE_ζ                                          # vs OSSE truth

SST gradient (front detection):
   ∇T magnitude — Sobel
   F1 score for fronts above threshold

Chl-a bloom timing:
   day-of-year of annual maximum (per pixel)
   RMSE in days (cf. Henson et al. 2018)

These metrics are the ones DUACS/MIOST publish — they’re how the community accepts a new SSH product. The VSSGP claim has to land here, not just on RMSE.

6.5 Per-variable experimental protocols

6.5.1 SSH — anchor variable, fullest treatment
Method ladder × 4:    {RFF-ridge, SIREN, SSGP, VSSGP-MIOST-init}
Region ladder × 4:    {Med, Gulf Stream, N. Atl, Global}
Time horizon × 3:     {1 month, 3 months, 6 months}
SWOT ablation × 2:    {nadir-only, nadir + SWOT}

⇒ 4 × 4 × 3 × 2 = 96 runs (Z4 × 6mo only for VSSGP, halve to ~80)
6.5.2 SST — diurnal cycle and fronts
Method ladder: same 4
Region: Med + Gulf Stream + Global (skip N. Atl alone — covered)
Likelihood: Student-T with ν ∈ {3, 5, 7} ablation (cloud-contamination tail)
Special diagnostic: gradient magnitude F1 score for fronts
6.5.3 SSS — extreme non-stationarity
Region: Amazon plume (3-week experiment), tropical Atlantic, global
Special: σ_w(x) field MUST be spatially varying — open-ocean σ_w ≈ 0.5 PSU,
         plume σ_w ≈ 3-5 PSU; flat σ_w fails by construction
6.5.4 OC / Chl-a — log-space and bloom dynamics
Region: N. Atlantic spring bloom (March-May), Arabian Sea (SW monsoon),
        Southern Ocean (austral spring), global
Transform: log₁₀ — the entire model lives in log space
Special: bloom timing metric — most operational mappings smear the bloom edge

6.6 Ablation matrices

Five ablations isolate which design choice is doing the work. Each is one row of the result table; designed to fail differently.

#AblationWhat it isolatesExpected outcome
A1VSSGP-uniform-prior vs VSSGP-MIOST-priorValue of operational priorA1 narrows nRMSE gap by ~50% (rest of gap is from posterior, not prior)
A2M ∈ {64, 128, 256, 512, 1024, 2048}Sample-complexity scalingRMSE vs M should be a clean power-law for VSSGP, ragged for SIREN
A3σ_w flat vs DUACS-derived σ_w(x)Value of spatially varying amplitudeBig gap on SSS, small on SSH
A4Isotropic vs DYMOST-anisotropic Σ_ωValue of flow-aligned priorGulf Stream RMSE drop ≥10% on v_g
A5Train Med, eval N. AtlanticPrior transferVSSGP-MIOST should hold; SSGP-no-prior should fail

6.7 Compute budget

Order-of-magnitude per run on 1× A100 80GB (per the wall-clock numbers in 01_efficient_machinery):

RegionN (full SWOT-era, 1mo)MTrain/runEval/runPer-method total
Z1 Med~1.5 × 10⁵10245 min1 min6 min
Z2 Gulf Stream~3 × 10⁵102412 min2 min14 min
Z3 N. Atlantic~1.5 × 10⁶20481.5 hr5 min1.6 hr
Z4 Global (patched)~6 × 10⁷per-patch 10242 hr (8× A100, dask)10 min2.2 hr

For the full method × region × variable × ablation grid: roughly 300–400 GPU-hours, dominated by Z3+Z4 + the M-sweep ablation. Tractable on a small cluster over a week.

6.8 Reporting template

Each variable + region combo produces one results row. Recommended format:

| Method        | RMSE  | CRPS  | NLPD  | L_eff | RMSE u_g | ECE |
|---------------|-------|-------|-------|-------|----------|-----|
| RFF-ridge     | 4.2cm |   —   |   —   | 140km |  9.3cm/s |  —  |
| SIREN         | 3.9cm |   —   |   —   | 165km |  9.8cm/s |  —  |
| SSGP          | 3.5cm | 1.8cm | -0.42 | 105km |  7.1cm/s | .12 |
| VSSGP-MIOST   | 3.1cm | 1.5cm | -0.61 |  82km |  6.2cm/s | .04 |
| ----          |       |       |       |       |          |     |
| DUACS L4      | 4.5cm |   —   |   —   | 150km |  8.5cm/s |  —  |
| MIOST         | 3.6cm |   —   |   —   | 110km |  7.4cm/s |  —  |

Bolding rule: per metric column, bold the best of {RFF, SIREN, SSGP, VSSGP}; italicize if it also beats all operational baselines in that column. The story of the paper is whichever cells consistently land in italic-bold.


Key References

Foundations

The hierarchy

Neural-field side

Geoscience priors