SSH reconstruction via GP pathwise sampling
SSH Reconstruction via GP Pathwise Sampling¶
The Inverse Problem¶
Goal: Reconstruct the SSH anomaly field η on a regular grid at a fixed target time , using satellite altimeter tracks from a temporal window . This is a pure static inverse problem — no dynamics, no time-stepping. We treat the SSH field as approximately stationary over the window and let the kernel handle temporal discounting automatically.
Why pool ? A single Jason-3 pass covers roughly 1–5% of the Mediterranean at any given time, with a 10-day repeat cycle. A single Sentinel-6 pass is similar. Pooling passes from nearby times dramatically increases spatial coverage. The cost is a temporal mismatch — observations at represent the field at slightly different states. Rather than manually inflating error variances (as 3D-Var does), the GP handles this gracefully: a spatiotemporal kernel naturally assigns lower covariance to observation pairs that are far apart in time, so temporally distant tracks are automatically downweighted during inference.
Domain and State¶
| Symbol | Shape | Meaning |
|---|---|---|
| scalars | Number of grid cells in latitude and longitude | |
| scalar | Total grid cells — e.g. grid over Med | |
| SSH anomaly field at time , vectorised row-major over the grid |
The field η is what we want to recover. It is unobserved everywhere except along sparse satellite swaths.
Observations¶
Satellite altimeters sample SSH along 1D ground tracks — narrow strips that cross the Mediterranean at oblique angles. We pool three groups of passes:
| Symbol | Shape | Meaning |
|---|---|---|
| Along-track SSH anomaly measurements at time | ||
| Along-track SSH anomaly measurements at time | ||
| Along-track SSH anomaly measurements at time | ||
| Full concatenated observation vector, |
Typical values for the Mediterranean: –2000 per pass group, –8000 total. Each observation has an associated position and time .
The GP Prior¶
A Gaussian process is a distribution over functions. Writing where is the joint (space, time) domain:
This means: for any finite collection of inputs , the vector of function values is jointly Gaussian with zero mean and covariance matrix whose entry is .
The kernel is the sole prior specification — it encodes everything we believe about the SSH field before seeing data: how smooth it is, how quickly correlations decay in space and time, and what its overall amplitude is.
Zero mean is appropriate here because we work with SSH anomalies (mean SSH removed), so the prior mean is zero by construction.
Training and Prediction Inputs¶
The GP operates directly on (lon, lat, time) coordinates — no grid required at training time. Define:
Each row is one satellite observation’s (lon, lat, time) coordinate. The prediction inputs are all grid cells at time :
All prediction points share the same time — we want the reconstructed field at the target time only.
Kernel — Encoding Prior Physics¶
Use a separable spatiotemporal kernel — separable means the spatial and temporal contributions multiply:
Spatial Kernel — Matérn-3/2¶
The Matérn-3/2 kernel on the sphere (using great-circle distance ):
Why Matérn-3/2 and not Gaussian? The squared-exponential (Gaussian) kernel produces infinitely differentiable sample paths — unrealistically smooth for SSH. Matérn-3/2 produces once mean-square differentiable paths, which is physically appropriate for mesoscale SSH fields that have sharp fronts and eddy boundaries. Matérn-5/2 (twice differentiable) is also a reasonable choice.
Why great-circle distance? The Mediterranean spans enough longitude (~40°) that planar Euclidean distance is inaccurate. Great-circle distance respects the spherical geometry.
Temporal Kernel — Ornstein-Uhlenbeck¶
The OU kernel (Matérn-1/2) produces exponentially decaying temporal correlations. The key physical consequence: observations at contribute to the reconstruction at time with a weight of — automatically. If days and days, this factor is , meaning observations 3 days away carry about 74% of the weight of contemporaneous observations. No manual tuning of temporal weights needed.
Hyperparameters¶
| Parameter | Meaning | Typical value (Med) |
|---|---|---|
| SSH anomaly marginal variance | ||
| Spatial decorrelation length | ||
| Temporal decorrelation timescale | ||
| Altimeter noise variance |
Key Matrices¶
These are the four objects that define all subsequent computation.
Gram matrix at training points:
This is a dense symmetric PSD matrix. Entry is the prior covariance between observations and — it encodes how similar the SSH field is expected to be at those two (location, time) pairs. Observations close in space and time will have high covariance; observations far apart will have near-zero covariance. The diagonal (the full prior variance at each point).
Noisy Gram matrix:
Adding accounts for altimeter instrument noise — it regularises the Gram matrix (ensuring it is strictly PD) and prevents overfitting to noisy observations. is the object we invert — it represents the total observation-space covariance including noise.
Cross-covariance matrix:
Row of this matrix is the prior covariance between grid cell (at time ) and observation (at its (location, time)). This is the gain numerator — it determines how strongly each observation influences each grid cell. A grid cell far from all tracks will have small entries across its entire row, meaning observations barely update it. A grid cell directly under a track will have large entries for nearby observations.
Prior variance at prediction points (diagonal only):
Since all prediction points share time and the spatial kernel equals 1 at zero distance, the prior variance is uniformly. The full matrix is never formed.
Posterior Derivation¶
The GP prior and the observation model:
together define a joint Gaussian over :
Applying the Gaussian conditioning formula from the unifying framework with:
the posterior is:
The posterior mean is the minimum mean-square error estimate of the SSH field — the best single map you can produce. The posterior covariance describes the joint uncertainty over all grid cells — but at it is completely intractable to store or compute for .
Practical posterior quantities: We never compute directly. Instead:
- Pointwise variance (diagonal only): — computed row by row, cost .
- Posterior samples via pathwise update — this is the subject of the next section.
The Pathwise Update — Correcting Prior Samples¶
The standard posterior mean formula above requires — a single linear solve. That gives you a point estimate. To get samples from the posterior, the naive approach would be to Cholesky-factorise the posterior covariance — completely intractable.
Matheron’s rule (Wilson et al., 2020) provides an exact alternative. The key insight is that for Gaussian random variables, the conditional distribution can be written as a linear correction to a joint prior sample:
where is a joint prior sample drawn consistently from at both the prediction and training inputs. This equality is in distribution — the right-hand side is an exact draw from the posterior .
Why does this work? Because the correction term is the posterior mean applied not to but to the innovation . Since are jointly Gaussian, the correction is linear, and the result has exactly the right mean and covariance.
Why is this better than sampling from directly?
- Sampling from requires an Cholesky: compute, storage — completely intractable for .
- The pathwise update only requires an Cholesky (computed once) plus an prior sample generation and an mat-vec. All operations scale in and independently.
Prior Sample via Random Fourier Features¶
To apply Matheron’s rule we need a joint prior sample drawn consistently from .
Bochner’s theorem guarantees that a stationary kernel can be written as the Fourier transform of a spectral density :
This motivates the Random Fourier Features (RFF) approximation. Sample frequency vectors from and draw uniform phase offsets . Define the feature map:
Then the approximate prior sample is:
where and are RFF features for the spatial and temporal kernels respectively (drawn once from their spectral densities). The weights are fixed for a given sample draw — evaluating at any point is then a deterministic computation.
Consistency: because are fixed, evaluating at and uses the same weights — this is what makes the prior sample joint and consistent, which is required for Matheron’s rule to produce exact posterior draws.
How many features ? The RFF approximation introduces a bias of in the kernel approximation. For SSH reconstruction, –4000 is typically sufficient. The approximation error in the posterior sample decreases as increases; the compute cost grows as per sample.
Full Algorithm¶
Offline (once per time window)¶
Step 1 — Build the Gram matrix:
Evaluate kernel calls. Each call computes a great-circle distance and evaluates Matérn-3/2 × OU — per pair.
Step 2 — Form and Cholesky-factorise :
This is the dominant computation. is stored and reused for all subsequent solves and for every posterior sample.
Step 3 — Compute the posterior mean coefficient vector:
Two triangular solves. α is the vector of dual weights — entry is the weight assigned to observation in the posterior mean. Observations with high are highly informative; those in dense clusters will have smaller due to redundancy.
Step 4 — Build the cross-covariance matrix:
Evaluate kernel calls. Since all prediction points share time , the temporal factor takes only three distinct values (one per time group ) — this allows factoring:
where is the purely spatial cross-covariance matrix between all grid points and the ⋆-group track locations. This factorisation reduces the number of unique kernel evaluations to spatial evaluations plus cheap exponential evaluations.
Step 5 — Compute the posterior mean field:
A dense mat-vec: times . This gives the interpolated SSH field on the grid at time .
Per Posterior Sample (repeat times)¶
Step 6 — Draw RFF weights:
Sample spatial frequencies from the Matérn-3/2 spectral density, temporal frequencies from the OU spectral density (both done once, reused across samples if desired, or redrawn for fresh samples).
Step 7 — Evaluate prior sample at training and prediction inputs:
Each evaluation: form the feature matrix Φ, then multiply by . Cost .
Step 8 — Compute the innovation:
This is the residual between what the prior sample predicts at the track locations and what was actually observed. If the prior sample happened to pass through the data exactly, and no correction is needed — the prior sample is already a valid posterior draw (this never happens in practice).
Step 9 — Solve the correction coefficient vector:
Two triangular solves using the already-computed . This is — cheap because is already available.
Step 10 — Apply the pathwise correction:
One dense mat-vec: times . The result is one exact posterior sample of the SSH field on the grid at time .
What the Output Looks Like¶
After samples :
Posterior mean estimate (best single map):
For large this converges to the exact posterior mean computed in Step 5 — but you may as well use Step 5 directly.
Pointwise posterior standard deviation (uncertainty map):
Large where tracks are sparse; small under dense track coverage. This is physically interpretable — you know exactly where the reconstruction is trustworthy.
Physically consistent realisations: Each is a plausible SSH field — spatially smooth, consistent with the kernel’s length scale, and passing through the observations within noise. These can be passed directly to downstream analyses (eddy detection, geostrophic current estimation) to propagate uncertainty.
Storage Analysis¶
All costs are at float64 (8 bytes). Let , , , .
| Object | Shape | Dtype | Storage formula | Size |
|---|---|---|---|---|
| float64 | bytes | |||
| float64 | bytes | |||
| float64 | bytes | |||
| float64 | bytes | — can overwrite | ||
| (Cholesky factor) | float64 | bytes (triangular) | ||
| float64 | bytes | |||
| float64 | bytes | |||
| RFF spectral params | float64 | bytes | ||
| float64 | bytes | |||
| float64 | bytes | |||
| float64 | bytes | |||
| (one sample) | float64 | bytes | ||
| (all samples) | float64 | bytes |
Total storage: dominated by at — approximately 4 GB for these parameters.
The term ( combined for and ) is secondary to the term ( for ). If memory is tight, can be computed in row-blocks and the mat-vecs in Steps 5 and 10 done blockwise — reducing peak memory to where is the block size.
Compute Analysis¶
Let be the cost of a single kernel evaluation (great-circle distance + Matérn + OU): but with a non-trivial constant (trig functions for great-circle).
Offline (once per time window)¶
| Step | Operation | Cost | Notes |
|---|---|---|---|
| Build | kernel evaluations | Symmetric: unique evals | |
| Cholesky of | Dense factorisation | Dominant offline cost | |
| Solve | Two triangular solves | Negligible after Cholesky | |
| Build | kernel evaluations | Embarrassingly parallel | |
| Compute $\mu_{f | y} = K_{\mathcal{X}^*\mathcal{X}}\alpha$ | Dense mat-vec |
Total offline:
For : flops. At 1012 flops/sec (modern CPU): seconds. — sub-second.
Per Posterior Sample (Steps 6–10)¶
| Step | Operation | Cost | Notes |
|---|---|---|---|
| Draw and spectral freqs | Sample Gaussians + uniforms | Negligible | |
| Evaluate | Form feature matrix, multiply by | ||
| Evaluate | Form feature matrix, multiply by | Dominant per-sample cost if | |
| Compute | Vector subtract | Negligible | |
| Solve | Two triangular solves | Fast — precomputed | |
| Correction | Dense mat-vec | BLAS-2, fast | |
| Add: | Vector add | Negligible |
Total per sample:
For , , : , , . The dominant per-sample cost is the mat-vec at .
For samples: flops — about 70 seconds on CPU, or seconds on GPU (all mat-vecs are highly parallelisable).
Total end-to-end:
Complexity Summary¶
| Phase | Storage | Compute |
|---|---|---|
| Build , Cholesky | ||
| Build , posterior mean | — bottleneck | |
| posterior samples | ||
| Total |
Scaling regime:
- When is small (): Cholesky is fast; and mat-vecs dominate.
- When is large (): Cholesky becomes intractable — need sparse GP / inducing points to bring back down.
- When is large: per-sample mat-vec dominates; batch across samples using output matrix.
- When is large (, sub- grid): storage becomes intractable — compute in blocks of rows, never materialise the full matrix.