Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

SSH reconstruction via GP pathwise sampling

SSH Reconstruction via GP Pathwise Sampling

The Inverse Problem

Goal: Reconstruct the SSH anomaly field η on a regular grid at a fixed target time tt, using satellite altimeter tracks from a temporal window [tτ,t+τ][t - \tau, t + \tau]. This is a pure static inverse problem — no dynamics, no time-stepping. We treat the SSH field as approximately stationary over the window and let the kernel handle temporal discounting automatically.

Why pool t±τt \pm \tau? A single Jason-3 pass covers roughly 1–5% of the Mediterranean at any given time, with a 10-day repeat cycle. A single Sentinel-6 pass is similar. Pooling passes from nearby times dramatically increases spatial coverage. The cost is a temporal mismatch — observations at t±τt \pm \tau represent the field at slightly different states. Rather than manually inflating error variances (as 3D-Var does), the GP handles this gracefully: a spatiotemporal kernel naturally assigns lower covariance to observation pairs that are far apart in time, so temporally distant tracks are automatically downweighted during inference.


Domain and State

SymbolShapeMeaning
nlat,nlonn_{lat}, n_{lon}scalarsNumber of grid cells in latitude and longitude
n=nlat×nlonn = n_{lat} \times n_{lon}scalarTotal grid cells — e.g. 0.1°0.1° grid over Med 105\approx 10^5
ηRn\eta \in \mathbb{R}^n(n,)(n,)SSH anomaly field at time tt, vectorised row-major over the grid

The field η is what we want to recover. It is unobserved everywhere except along sparse satellite swaths.


Observations

Satellite altimeters sample SSH along 1D ground tracks — narrow strips that cross the Mediterranean at oblique angles. We pool three groups of passes:

SymbolShapeMeaning
yRmy_- \in \mathbb{R}^{m_-}(m,)(m_-,)Along-track SSH anomaly measurements at time tτt - \tau
y0Rm0y_0 \in \mathbb{R}^{m_0}(m0,)(m_0,)Along-track SSH anomaly measurements at time tt
y+Rm+y_+ \in \mathbb{R}^{m_+}(m+,)(m_+,)Along-track SSH anomaly measurements at time t+τt + \tau
y=[y;y0;y+]y = [y_-;\, y_0;\, y_+](m,)(m,)Full concatenated observation vector, m=m+m0+m+m = m_- + m_0 + m_+

Typical values for the Mediterranean: m±500m_\pm \sim 5002000 per pass group, m3000m \sim 30008000 total. Each observation yiy_i has an associated position (loni,lati)(lon_i, lat_i) and time ti{tτ,t,t+τ}t_i \in \{t-\tau, t, t+\tau\}.


The GP Prior

A Gaussian process is a distribution over functions. Writing f:XRf: \mathcal{X} \to \mathbb{R} where X\mathcal{X} is the joint (space, time) domain:

fGP(0,kθ)f \sim \mathrm{GP}(0, k_\theta)

This means: for any finite collection of inputs {(xi,ti)}i=1m\{(x_i, t_i)\}_{i=1}^m, the vector of function values [f(x1,t1),,f(xm,tm)][f(x_1, t_1), \ldots, f(x_m, t_m)] is jointly Gaussian with zero mean and covariance matrix whose (i,j)(i,j) entry is kθ((xi,ti),(xj,tj))k_\theta((x_i, t_i), (x_j, t_j)).

The kernel kθk_\theta is the sole prior specification — it encodes everything we believe about the SSH field before seeing data: how smooth it is, how quickly correlations decay in space and time, and what its overall amplitude is.

Zero mean is appropriate here because we work with SSH anomalies (mean SSH removed), so the prior mean is zero by construction.


Training and Prediction Inputs

The GP operates directly on (lon, lat, time) coordinates — no grid required at training time. Define:

X=[lon1lat1tτlonmlatmtτlon10lat10tlonm++latm++t+τ]Rm×3\mathcal{X} = \begin{bmatrix} lon_1^- & lat_1^- & t-\tau \\ \vdots & \vdots & \vdots \\ lon_{m_-}^- & lat_{m_-}^- & t-\tau \\ lon_1^0 & lat_1^0 & t \\ \vdots & \vdots & \vdots \\ lon_{m_+}^+ & lat_{m_+}^+ & t+\tau \end{bmatrix} \in \mathbb{R}^{m \times 3}

Each row is one satellite observation’s (lon, lat, time) coordinate. The prediction inputs are all grid cells at time tt:

X=[lon1lat1tlonnlatnt]Rn×3\mathcal{X}^* = \begin{bmatrix} lon_1^* & lat_1^* & t \\ \vdots & \vdots & \vdots \\ lon_n^* & lat_n^* & t \end{bmatrix} \in \mathbb{R}^{n \times 3}

All prediction points share the same time tt — we want the reconstructed field at the target time only.


Kernel — Encoding Prior Physics

Use a separable spatiotemporal kernel — separable means the spatial and temporal contributions multiply:

kθ ⁣((x,t),(x,t))=ση2ks(x,x;Ls)kt(t,t;Lt)k_\theta\!\left((x, t),\, (x', t')\right) = \sigma_\eta^2 \cdot k_s(x, x';\, L_s) \cdot k_t(t, t';\, L_t)

Spatial Kernel — Matérn-3/2

The Matérn-3/2 kernel on the sphere (using great-circle distance dd):

ks(x,x;Ls)=(1+3d(x,x)Ls)exp ⁣(3d(x,x)Ls)k_s(x, x';\, L_s) = \left(1 + \frac{\sqrt{3}\,d(x,x')}{L_s}\right)\exp\!\left(-\frac{\sqrt{3}\,d(x,x')}{L_s}\right)

Why Matérn-3/2 and not Gaussian? The squared-exponential (Gaussian) kernel produces infinitely differentiable sample paths — unrealistically smooth for SSH. Matérn-3/2 produces once mean-square differentiable paths, which is physically appropriate for mesoscale SSH fields that have sharp fronts and eddy boundaries. Matérn-5/2 (twice differentiable) is also a reasonable choice.

Why great-circle distance? The Mediterranean spans enough longitude (~40°) that planar Euclidean distance is inaccurate. Great-circle distance respects the spherical geometry.

Temporal Kernel — Ornstein-Uhlenbeck

kt(t,t;Lt)=exp ⁣(ttLt)k_t(t, t';\, L_t) = \exp\!\left(-\frac{|t - t'|}{L_t}\right)

The OU kernel (Matérn-1/2) produces exponentially decaying temporal correlations. The key physical consequence: observations at t±τt \pm \tau contribute to the reconstruction at time tt with a weight of exp(τ/Lt)\exp(-\tau / L_t) — automatically. If Lt=10L_t = 10 days and τ=3\tau = 3 days, this factor is exp(0.3)0.74\exp(-0.3) \approx 0.74, meaning observations 3 days away carry about 74% of the weight of contemporaneous observations. No manual tuning of temporal weights needed.

Hyperparameters

ParameterMeaningTypical value (Med)
ση2\sigma_\eta^2SSH anomaly marginal variance100cm2\sim 100\,\text{cm}^2
LsL_sSpatial decorrelation length100km\sim 100\,\text{km}
LtL_tTemporal decorrelation timescale10days\sim 10\,\text{days}
σobs2\sigma_{obs}^2Altimeter noise variance4cm2\sim 4\,\text{cm}^2

Key Matrices

These are the four objects that define all subsequent computation.

Gram matrix at training points:

KXXRm×m,[KXX]ij=kθ(Xi,Xj)K_{\mathcal{X}\mathcal{X}} \in \mathbb{R}^{m \times m}, \qquad [K_{\mathcal{X}\mathcal{X}}]_{ij} = k_\theta(\mathcal{X}_i, \mathcal{X}_j)

This is a dense symmetric PSD matrix. Entry (i,j)(i,j) is the prior covariance between observations ii and jj — it encodes how similar the SSH field is expected to be at those two (location, time) pairs. Observations close in space and time will have high covariance; observations far apart will have near-zero covariance. The diagonal [KXX]ii=ση2[K_{\mathcal{X}\mathcal{X}}]_{ii} = \sigma_\eta^2 (the full prior variance at each point).

Noisy Gram matrix:

C=KXX+σobs2ImRm×mC = K_{\mathcal{X}\mathcal{X}} + \sigma_{obs}^2 I_m \in \mathbb{R}^{m \times m}

Adding σobs2Im\sigma_{obs}^2 I_m accounts for altimeter instrument noise — it regularises the Gram matrix (ensuring it is strictly PD) and prevents overfitting to noisy observations. CC is the object we invert — it represents the total observation-space covariance including noise.

Cross-covariance matrix:

KXXRn×m,[KXX]ij=kθ(Xi,Xj)K_{\mathcal{X}^*\mathcal{X}} \in \mathbb{R}^{n \times m}, \qquad [K_{\mathcal{X}^*\mathcal{X}}]_{ij} = k_\theta(\mathcal{X}^*_i, \mathcal{X}_j)

Row ii of this matrix is the prior covariance between grid cell ii (at time tt) and observation jj (at its (location, time)). This is the gain numerator — it determines how strongly each observation influences each grid cell. A grid cell far from all tracks will have small entries across its entire row, meaning observations barely update it. A grid cell directly under a track will have large entries for nearby observations.

Prior variance at prediction points (diagonal only):

[KXX]ii=kθ(Xi,Xi)=ση2i[K_{\mathcal{X}^*\mathcal{X}^*}]_{ii} = k_\theta(\mathcal{X}^*_i, \mathcal{X}^*_i) = \sigma_\eta^2 \quad \forall i

Since all prediction points share time tt and the spatial kernel equals 1 at zero distance, the prior variance is ση2\sigma_\eta^2 uniformly. The full matrix KXXRn×nK_{\mathcal{X}^*\mathcal{X}^*} \in \mathbb{R}^{n \times n} is never formed.


Posterior Derivation

The GP prior and the observation model:

y=f(X)+ε,εN(0,σobs2Im)y = f(\mathcal{X}) + \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, \sigma_{obs}^2 I_m)

together define a joint Gaussian over (f(X),y)(f(\mathcal{X}^*), y):

[f(X)y]N ⁣([00], [KXXKXXKXXC])\begin{bmatrix} f(\mathcal{X}^*) \\ y \end{bmatrix} \sim \mathcal{N}\!\left( \begin{bmatrix} 0 \\ 0 \end{bmatrix},\ \begin{bmatrix} K_{\mathcal{X}^*\mathcal{X}^*} & K_{\mathcal{X}^*\mathcal{X}} \\ K_{\mathcal{X}\mathcal{X}^*} & C \end{bmatrix} \right)

Applying the Gaussian conditioning formula from the unifying framework with:

Σxx=KXX,Σxz=KXX,Σzz=C\Sigma_{xx} = K_{\mathcal{X}^*\mathcal{X}^*}, \quad \Sigma_{xz} = K_{\mathcal{X}^*\mathcal{X}}, \quad \Sigma_{zz} = C

the posterior is:

f(X)y    N ⁣(μfy,  Σfy)f(\mathcal{X}^*) \mid y \;\sim\; \mathcal{N}\!\left(\mu_{f|y},\; \Sigma_{f|y}\right)
μfy=KXXC1y(n,)\mu_{f|y} = K_{\mathcal{X}^*\mathcal{X}}\, C^{-1}\, y \qquad (n,)
Σfy=KXXKXXC1KXX(n,n)\Sigma_{f|y} = K_{\mathcal{X}^*\mathcal{X}^*} - K_{\mathcal{X}^*\mathcal{X}}\, C^{-1}\, K_{\mathcal{X}\mathcal{X}^*} \qquad (n, n)

The posterior mean μfy\mu_{f|y} is the minimum mean-square error estimate of the SSH field — the best single map you can produce. The posterior covariance Σfy\Sigma_{f|y} describes the joint uncertainty over all grid cells — but at (n×n)(n \times n) it is completely intractable to store or compute for n=105n = 10^5.

Practical posterior quantities: We never compute Σfy\Sigma_{f|y} directly. Instead:


The Pathwise Update — Correcting Prior Samples

The standard posterior mean formula above requires C1yC^{-1}y — a single linear solve. That gives you a point estimate. To get samples from the posterior, the naive approach would be to Cholesky-factorise the (n×n)(n \times n) posterior covariance Σfy\Sigma_{f|y} — completely intractable.

Matheron’s rule (Wilson et al., 2020) provides an exact alternative. The key insight is that for Gaussian random variables, the conditional distribution can be written as a linear correction to a joint prior sample:

f(X)y  =d  f~(X)+KXXC1(yf~(X))f(\mathcal{X}^*) \mid y \;\overset{d}{=}\; \tilde{f}(\mathcal{X}^*) + K_{\mathcal{X}^*\mathcal{X}}\, C^{-1}\,\bigl(y - \tilde{f}(\mathcal{X})\bigr)

where (f~(X),f~(X))(\tilde{f}(\mathcal{X}^*), \tilde{f}(\mathcal{X})) is a joint prior sample drawn consistently from GP(0,kθ)\mathrm{GP}(0, k_\theta) at both the prediction and training inputs. This equality is in distribution — the right-hand side is an exact draw from the posterior p(fy)p(f \mid y).

Why does this work? Because the correction term is the posterior mean applied not to yy but to the innovation yf~(X)y - \tilde{f}(\mathcal{X}). Since (f~(X),f~(X))(\tilde{f}(\mathcal{X}^*), \tilde{f}(\mathcal{X})) are jointly Gaussian, the correction is linear, and the result has exactly the right mean and covariance.

Why is this better than sampling from Σfy\Sigma_{f|y} directly?


Prior Sample via Random Fourier Features

To apply Matheron’s rule we need a joint prior sample (f~(X),f~(X))(\tilde{f}(\mathcal{X}^*), \tilde{f}(\mathcal{X})) drawn consistently from GP(0,kθ)\mathrm{GP}(0, k_\theta).

Bochner’s theorem guarantees that a stationary kernel kθ(x,x)=kθ(xx)k_\theta(x, x') = k_\theta(x - x') can be written as the Fourier transform of a spectral density p(ω)p(\omega):

kθ(xx)=p(ω)eiω(xx)dωk_\theta(x - x') = \int p(\omega)\, e^{i\omega^\top(x - x')}\, d\omega

This motivates the Random Fourier Features (RFF) approximation. Sample rr frequency vectors {ωj}j=1r\{\omega_j\}_{j=1}^r from p(ω)p(\omega) and draw uniform phase offsets {bj}j=1rUniform[0,2π]\{b_j\}_{j=1}^r \sim \mathrm{Uniform}[0, 2\pi]. Define the feature map:

ϕj(x)=2ση2rcos(ωjx+bj)\phi_j(x) = \sqrt{\frac{2\sigma_\eta^2}{r}}\,\cos(\omega_j^\top x + b_j)

Then the approximate prior sample is:

f~(x,t)=j=1rwjϕjs(x)ϕjt(t),wjiidN(0,1)\tilde{f}(x, t) = \sum_{j=1}^{r} w_j\, \phi_j^s(x)\, \phi_j^t(t), \qquad w_j \overset{\text{iid}}{\sim} \mathcal{N}(0, 1)

where ϕjs\phi_j^s and ϕjt\phi_j^t are RFF features for the spatial and temporal kernels respectively (drawn once from their spectral densities). The weights {wj}\{w_j\} are fixed for a given sample draw — evaluating f~\tilde{f} at any point is then a deterministic computation.

Consistency: because {wj}\{w_j\} are fixed, evaluating f~\tilde{f} at X\mathcal{X} and X\mathcal{X}^* uses the same weights — this is what makes the prior sample joint and consistent, which is required for Matheron’s rule to produce exact posterior draws.

How many features rr? The RFF approximation introduces a bias of O(1/r)\mathcal{O}(1/\sqrt{r}) in the kernel approximation. For SSH reconstruction, r1000r \sim 10004000 is typically sufficient. The approximation error in the posterior sample decreases as rr increases; the compute cost grows as O(r(m+n))\mathcal{O}(r(m+n)) per sample.


Full Algorithm

Offline (once per time window)

Step 1 — Build the Gram matrix:

[KXX]ij=ση2ks ⁣(xi,xj;Ls)kt ⁣(ti,tj;Lt)(m,m)[K_{\mathcal{X}\mathcal{X}}]_{ij} = \sigma_\eta^2 \cdot k_s\!\left(x_i, x_j;\, L_s\right) \cdot k_t\!\left(t_i, t_j;\, L_t\right) \qquad (m, m)

Evaluate m2m^2 kernel calls. Each call computes a great-circle distance and evaluates Matérn-3/2 × OU — O(1)\mathcal{O}(1) per pair.

Step 2 — Form and Cholesky-factorise CC:

C=KXX+σobs2Im(m,m)C = K_{\mathcal{X}\mathcal{X}} + \sigma_{obs}^2 I_m \qquad (m, m)
C=LL,LRm×m lower triangularC = LL^\top, \qquad L \in \mathbb{R}^{m \times m} \text{ lower triangular}

This is the dominant computation. LL is stored and reused for all subsequent solves and for every posterior sample.

Step 3 — Compute the posterior mean coefficient vector:

α=C1y=LL1y(m,)\alpha = C^{-1}y = L^{-\top}L^{-1}y \qquad (m,)

Two triangular solves. α is the vector of dual weights — entry αi\alpha_i is the weight assigned to observation ii in the posterior mean. Observations with high αi\alpha_i are highly informative; those in dense clusters will have smaller αi\alpha_i due to redundancy.

Step 4 — Build the cross-covariance matrix:

[KXX]ij=ση2ks ⁣(xi,xj;Ls)kt ⁣(t,tj;Lt)(n,m)[K_{\mathcal{X}^*\mathcal{X}}]_{ij} = \sigma_\eta^2 \cdot k_s\!\left(x_i^*, x_j;\, L_s\right) \cdot k_t\!\left(t, t_j;\, L_t\right) \qquad (n, m)

Evaluate nmnm kernel calls. Since all prediction points share time tt, the temporal factor kt(t,tj;Lt)k_t(t, t_j; L_t) takes only three distinct values (one per time group tτ,t,t+τt-\tau, t, t+\tau) — this allows factoring:

KXX=[eτ/LtKs(,Ls)Ks(0,Ls)eτ/LtKs(+,Ls)]K_{\mathcal{X}^*\mathcal{X}} = \begin{bmatrix} e^{-\tau/L_t}\, K^*_s(-, L_s) & K^*_s(0, L_s) & e^{-\tau/L_t}\, K^*_s(+, L_s) \end{bmatrix}

where Ks(,Ls)Rn×mK^*_s(\star, L_s) \in \mathbb{R}^{n \times m_\star} is the purely spatial cross-covariance matrix between all grid points and the -group track locations. This factorisation reduces the number of unique kernel evaluations to n(m+m0+m+)n(m_- + m_0 + m_+) spatial evaluations plus mm cheap exponential evaluations.

Step 5 — Compute the posterior mean field:

μfy=KXXα(n,)\mu_{f|y} = K_{\mathcal{X}^*\mathcal{X}}\, \alpha \qquad (n,)

A dense mat-vec: (n×m)(n \times m) times (m×1)(m \times 1). This gives the interpolated SSH field on the grid at time tt.

Per Posterior Sample (repeat SS times)

Step 6 — Draw RFF weights:

wjiidN(0,1),j=1,,rw_j \overset{\text{iid}}{\sim} \mathcal{N}(0, 1), \quad j = 1, \ldots, r

Sample spatial frequencies ωjs\omega_j^s from the Matérn-3/2 spectral density, temporal frequencies ωjt\omega_j^t from the OU spectral density (both done once, reused across samples if desired, or redrawn for fresh samples).

Step 7 — Evaluate prior sample at training and prediction inputs:

f~(X)Rm,f~(X)Rn\tilde{f}(\mathcal{X}) \in \mathbb{R}^m, \qquad \tilde{f}(\mathcal{X}^*) \in \mathbb{R}^n

Each evaluation: form the (m+n)×r(m+n) \times r feature matrix Φ, then multiply by ww. Cost O(r(m+n))\mathcal{O}(r(m+n)).

Step 8 — Compute the innovation:

δ=yf~(X)(m,)\delta = y - \tilde{f}(\mathcal{X}) \qquad (m,)

This is the residual between what the prior sample predicts at the track locations and what was actually observed. If the prior sample happened to pass through the data exactly, δ=0\delta = 0 and no correction is needed — the prior sample is already a valid posterior draw (this never happens in practice).

Step 9 — Solve the correction coefficient vector:

β=C1δ=LL1δ(m,)\beta = C^{-1}\delta = L^{-\top}L^{-1}\delta \qquad (m,)

Two triangular solves using the already-computed LL. This is O(m2)\mathcal{O}(m^2) — cheap because LL is already available.

Step 10 — Apply the pathwise correction:

ηs=f~(X)+KXXβ(n,)\boxed{\eta^*_s = \tilde{f}(\mathcal{X}^*) + K_{\mathcal{X}^*\mathcal{X}}\,\beta \qquad (n,)}

One dense mat-vec: (n×m)(n \times m) times (m×1)(m \times 1). The result ηs\eta^*_s is one exact posterior sample of the SSH field on the grid at time tt.


What the Output Looks Like

After SS samples {ηs}s=1S\{\eta^*_s\}_{s=1}^S:

Posterior mean estimate (best single map):

η^=1Ss=1Sηsμfy(n,)\hat{\eta} = \frac{1}{S}\sum_{s=1}^S \eta^*_s \approx \mu_{f|y} \qquad (n,)

For large SS this converges to the exact posterior mean KXXαK_{\mathcal{X}^*\mathcal{X}}\alpha computed in Step 5 — but you may as well use Step 5 directly.

Pointwise posterior standard deviation (uncertainty map):

σ^(xi)=1S1s=1S(ηs(xi)η^(xi))2(n,)\hat{\sigma}(x_i^*) = \sqrt{\frac{1}{S-1}\sum_{s=1}^S (\eta^*_s(x_i^*) - \hat{\eta}(x_i^*))^2} \qquad (n,)

Large σ^\hat{\sigma} where tracks are sparse; small σ^\hat{\sigma} under dense track coverage. This is physically interpretable — you know exactly where the reconstruction is trustworthy.

Physically consistent realisations: Each ηs\eta^*_s is a plausible SSH field — spatially smooth, consistent with the kernel’s length scale, and passing through the observations within noise. These can be passed directly to downstream analyses (eddy detection, geostrophic current estimation) to propagate uncertainty.


Storage Analysis

All costs are at float64 (8 bytes). Let n=105n = 10^5, m=5×103m = 5 \times 10^3, r=2×103r = 2 \times 10^3, S=100S = 100.

ObjectShapeDtypeStorage formulaSize
X\mathcal{X}(m,3)(m, 3)float6424m24m bytes120KB120\,\text{KB}
yy(m,)(m,)float648m8m bytes40KB40\,\text{KB}
KXXK_{\mathcal{X}\mathcal{X}}(m,m)(m, m)float648m28m^2 bytes200MB200\,\text{MB}
C=KXX+σ2IC = K_{\mathcal{X}\mathcal{X}} + \sigma^2 I(m,m)(m, m)float648m28m^2 bytes200MB200\,\text{MB} — can overwrite KXXK_{\mathcal{X}\mathcal{X}}
LL (Cholesky factor)(m,m)(m, m)float644m24m^2 bytes (triangular)100MB100\,\text{MB}
α=C1y\alpha = C^{-1}y(m,)(m,)float648m8m bytes40KB40\,\text{KB}
KXXK_{\mathcal{X}^*\mathcal{X}}(n,m)(n, m)float648nm8nm bytes4GB4\,\text{GB}
RFF spectral params(r,3)(r, 3)float6424r24r bytes48KB48\,\text{KB}
f~(X)\tilde{f}(\mathcal{X})(m,)(m,)float648m8m bytes40KB40\,\text{KB}
f~(X)\tilde{f}(\mathcal{X}^*)(n,)(n,)float648n8n bytes800KB800\,\text{KB}
β=C1δ\beta = C^{-1}\delta(m,)(m,)float648m8m bytes40KB40\,\text{KB}
ηs\eta^*_s (one sample)(n,)(n,)float648n8n bytes800KB800\,\text{KB}
{ηs}s=1S\{\eta^*_s\}_{s=1}^S (all samples)(S,n)(S, n)float648Sn8Sn bytes80MB80\,\text{MB}

Total storage: dominated by KXXK_{\mathcal{X}^*\mathcal{X}} at O(nm)\mathcal{O}(nm) — approximately 4 GB for these parameters.

Storage=O(m2+nm)\text{Storage} = \mathcal{O}(m^2 + nm)

The m2m^2 term (400MB\sim 400\,\text{MB} combined for KXXK_{\mathcal{X}\mathcal{X}} and LL) is secondary to the nmnm term (4GB4\,\text{GB} for KXXK_{\mathcal{X}^*\mathcal{X}}). If memory is tight, KXXK_{\mathcal{X}^*\mathcal{X}} can be computed in row-blocks and the mat-vecs in Steps 5 and 10 done blockwise — reducing peak memory to O(m2+bm)\mathcal{O}(m^2 + b \cdot m) where bb is the block size.


Compute Analysis

Let dkerd_{ker} be the cost of a single kernel evaluation (great-circle distance + Matérn + OU): O(1)\mathcal{O}(1) but with a non-trivial constant (trig functions for great-circle).

Offline (once per time window)

StepOperationCostNotes
Build KXXK_{\mathcal{X}\mathcal{X}}m2m^2 kernel evaluationsO(m2dker)\mathcal{O}(m^2 d_{ker})Symmetric: m(m+1)/2m(m+1)/2 unique evals
Cholesky of CCDense factorisationO(m3/3)\mathcal{O}(m^3 / 3)Dominant offline cost
Solve α=C1y\alpha = C^{-1}yTwo triangular solvesO(m2)\mathcal{O}(m^2)Negligible after Cholesky
Build KXXK_{\mathcal{X}^*\mathcal{X}}nmnm kernel evaluationsO(nmdker)\mathcal{O}(nm\, d_{ker})Embarrassingly parallel
Compute $\mu_{fy} = K_{\mathcal{X}^*\mathcal{X}}\alpha$Dense mat-vecO(nm)\mathcal{O}(nm)

Total offline: O(m3+nm)\mathcal{O}(m^3 + nm)

For m=5×103m = 5\times10^3: m3/34×1010m^3/3 \approx 4 \times 10^{10} flops. At 1012 flops/sec (modern CPU): 40\sim 40 seconds. nm=5×108nm = 5 \times 10^8 — sub-second.

Per Posterior Sample (Steps 6–10)

StepOperationCostNotes
Draw wjw_j and spectral freqsSample rr Gaussians + rr uniformsO(r)\mathcal{O}(r)Negligible
Evaluate f~(X)\tilde{f}(\mathcal{X})Form (m×r)(m \times r) feature matrix, multiply by wwO(mr)\mathcal{O}(mr)
Evaluate f~(X)\tilde{f}(\mathcal{X}^*)Form (n×r)(n \times r) feature matrix, multiply by wwO(nr)\mathcal{O}(nr)Dominant per-sample cost if nr>nmnr > nm
Compute δ=yf~(X)\delta = y - \tilde{f}(\mathcal{X})Vector subtractO(m)\mathcal{O}(m)Negligible
Solve β=C1δ\beta = C^{-1}\deltaTwo triangular solvesO(m2)\mathcal{O}(m^2)Fast — LL precomputed
Correction KXXβK_{\mathcal{X}^*\mathcal{X}}\betaDense mat-vecO(nm)\mathcal{O}(nm)BLAS-2, fast
Add: η=f~(X)+KXXβ\eta^* = \tilde{f}(\mathcal{X}^*) + K_{\mathcal{X}^*\mathcal{X}}\betaVector addO(n)\mathcal{O}(n)Negligible

Total per sample: O(r(m+n)+m2+nm)=O(nm+rn)\mathcal{O}(r(m + n) + m^2 + nm) = \mathcal{O}(nm + rn)

For n=105n = 10^5, m=5×103m = 5 \times 10^3, r=2×103r = 2 \times 10^3: nm=5×108nm = 5 \times 10^8, nr=2×108nr = 2 \times 10^8, m2=2.5×107m^2 = 2.5 \times 10^7. The dominant per-sample cost is the mat-vec KXXβK_{\mathcal{X}^*\mathcal{X}}\beta at O(nm)\mathcal{O}(nm).

For S=100S = 100 samples: 100×(nm+rn)7×1010100 \times (nm + rn) \approx 7 \times 10^{10} flops — about 70 seconds on CPU, or seconds on GPU (all mat-vecs are highly parallelisable).

Total end-to-end:

Compute=O(m3)Cholesky+O(nm)cross-cov + mean+SO(nm+rn)posterior samples\text{Compute} = \underbrace{\mathcal{O}(m^3)}_{\text{Cholesky}} + \underbrace{\mathcal{O}(nm)}_{\text{cross-cov + mean}} + \underbrace{S \cdot \mathcal{O}(nm + rn)}_{\text{posterior samples}}

Complexity Summary

PhaseStorageCompute
Build KXXK_{\mathcal{X}\mathcal{X}}, CholeskyO(m2)\mathcal{O}(m^2)O(m3)\mathcal{O}(m^3)
Build KXXK_{\mathcal{X}^*\mathcal{X}}, posterior meanO(nm)\mathcal{O}(nm)bottleneckO(nm)\mathcal{O}(nm)
SS posterior samplesO(Sn)\mathcal{O}(Sn)SO(nm+rn)S \cdot \mathcal{O}(nm + rn)
TotalO(m2+nm)\mathcal{O}(m^2 + nm)O(m3+nm(1+S)+Srn)\mathcal{O}(m^3 + nm(1 + S) + Srn)

Scaling regime: