Ocean Mapping

Data Size - Small, Medium, Large
Linear -> NonLinear
Deterministic, Probabilistic, Bayesian

0.0 - Datasets¶

Fire
Drought
Agro
Temperature, Precipitation

1.0 - Learning with Observation Data¶

Data:¶

L2 Gappy Observations
- Global Land Surface Atmospheric Variables - CDS
- Global Marine Surface Meteorological Variables - CDS
- SOCAT - Website
L3 Gap-Filled Observations

Cases:¶

Data - L2 Obs
Data - L2 Obs and L3 Interpolated Obs

Formulation:¶

f: y (\Omega_y, \mathcal{T}_y ) \times \Theta \rightarrow y (\Omega_z, \mathcal{T}_z)

(1)

1.1 - Discretization¶

A method that discretizes the unstructured data into a structured representation, e.g., a Cartesian, rectilinear or curvilinear grid.

Use Case

Data 4 Learning -> Parameters, Interpolator
Data 4 Estimation -> State, Latent State

1.1.1 - Histogram¶

a - Equidistant Binning 4 Cartesian Grids - Global + Masks + Weights - Boost-Histogram | xarray-histogram | dask-histogram | xarray | xcdat
b - Adaptive Binning 4 Rectilinear Grids - KBinsDiscretizer - sklearn tutorial | tutorial
c - Graph-Node Binning:
- Voronoi - Voronoi w/ Python | Object Seg. w/ Voronoi | Semi-Discrete Flows
- K-Means - Ocean Clustering Example | Region Joining

Supp. Material¶

Formulation
Viz of neighbors and radius Neighbours -

Link between histogram and parzen window -
Regression -
Gridding with geopandas -

sparse + xarray + geopandas -

1.2 - Non-Parametric Interpolator (Coordinate-Based)¶

A method that applies a non-parametric, coordinate-based regression algorithm to interpolate the observations based on SpatioTemporal location.

Use Case

Learning - Interpolated Maps
Estimation - Initial Conditions & Boundary Conditions 4 Data Assimilation

1.2.1 - Naive Methods¶

We will revisit the same methods used for the Discretization. This will include the kernel density method and the k nearest neighbors method.

a - PyInterp Baselines - Linear, IDW, RBF, Window Function, Kriging/OI/GPs, Splines

1.2.1a - Kernel Density Estimation¶

In this section, we look at kernel density estimation as a nonparametric methodology to gap-fill unstructured observations. We will start with the most basic method of k-nearest neighbors. Then we will look at scalable alternatives like KD-Tree, Ball-Tree, or FFT. We’ll also look at some ways to scale it via hardware like KeOps or cuml which both use advanced methods for taking advantage of GPUs.

Basic Methods:

Naive, Brute Force - sklearn tutorial | sklearn.neighbors.KernelDensity

Scaling

Algorithm:
- Tree-Based - jakevdp tutorial | sklearn.neighbors.KernelDensity | numba-neighbors
- Advanced Approximate NN- sklearn.ann | PyNNDescent
- FFT (Equidistant) - KDEPy - kdepy

Data Structure
- Sparse - sklearn.neighbors
Hardware:
- cuML - KDE cuml
- KeOps - KNN Example

Applied Problems:

KDE Regression - kdepy example | wiki | Derivation | Video Derivation | Error Analysis | pytorch example
Connection to Attention - d2l.ai | blog
KDE Examples with Viz - Visualizing GeoData | Point Pattern Analysis

1.2.1b - KNN Interpolation¶

Here, we use k-nearest neighbors (KNN) to do interpolation. This is one of the simplest, most versatile algorithms available for learning. This is a more scalable method which uses the nearest neighbors to interpolate gappy data. We also showcase how we can modify the distance metric with inverse weighting or a custom distance function, e.g., Gaussian kernel.

Basic Methods:

Probabilistic Interpretation - Course

Naive, Brute-Force, Parallel - sklearn.neighbors.KNeighborsRegressor | sklearn.neighbors.RadiusNeighborsRegressor | From Scratch

Distance - Uniform, IDW, Gaussian - example.ipynb

Scaling:

Algorithm:
- Tree-Based - sklearn.neighbors.KNeighborsRegressor | sklearn.neighbors.RadiusNeighborsRegressor
Hardware:
- cuML + Dask - Demo Blog | cuml.neighbors.KNeighborsRegressor

Example Applications:

Housing Interpolation w/ KNN + IDW - Medium

Strengths: K-nearest neighbors regression

is a simple, intuitive algorithm,
requires few assumptions about what the data must look like, and
works well with non-linear relationships (i.e., if the relationship is not a straight line).
The key merit of KNN is the quick computation time, easy interpretability, versatility to use across classification and regression problems and its non parametric nature (no need to any assumptions or data tuning)

Weaknesses: K-nearest neighbors regression

becomes very slow as the training data gets larger,
may not perform well with a large number of predictors, and
may not predict well beyond the range of values input in your training data.
In the KNN algorithm, for every new test data point, we need to find its distance to all of the training data points. This is quite hectic when we have a large data with several features. To solve this issue we can use some KNN extension methods like KD tree. I will discuss more on this in later blog posts.

KNN is also sensitive to irrelevant features but this issue can be addressed by feature selection. A possible solution is to perform PCA on the data and just chose the principal features for the KNN analysis.
KNN also needs to store all of the training data and this is can be quite costly in case of large data sets.

1.2.2 - GPs/OI/Kriging¶

This will feature tutorials to build up our GP/OI/Kriging mathematical proficiency. We will start by start by We will also look at some specific terminology, e.g., length scale vs lag

Applications

Data Assimilation - DA Window + LOWESS

We will use the LOWESS method to do interpolation on a subset of spatiotemporal data. We will look at 3 data types:
sea surface height with very sparse structured randomness
Sea surface temperature - dense structured randomness
Land Temperature Data -

Software

Optimal Interpolation 4 Data Assimilation (OI4DA) - package + xarray interface + sklearn column transforms

From Scratch

a - GP From Scratch - JAX + Cola - Demo NB
b - GP w/ Libs - JAX + TinyGP + Bayesian Inference (Demo NBs)
c - GP w/ PPLs - JAX + Cola + Numpyro
d - Customizing GP w/ PPLs - Custom TFP Distribution | Custom Numpyro Distribution

Canonical Example

Mauna Loa - Part I | Part II

Scaling

d - Kernel Matrix Approximations - sklearn.kernel_approximation | My kernellib
e - Hardware - KeOps | KeOPs + GPyTorch

Appendix

jax + kernel functions + jax.vmap
- Distances - scipy overview | jax demo
- Kernel Matrices - jax demo
- Kernel Matrix Derivatives - jax demo

1.2.3 - Improved GPs - Moment-Based¶

a - Sparse GPs w/ PPLs - My Jax Code + Bayesian Inference
b - SVGPs w/ PPLs - GPJax | Pyro-PPL | GPyTorch
c - Structured GPs - SKI/SKIP (Precious Work, Example)
d - Deep Kernel Learning - DUE | GPyTorch | Pyro-PPL

1.2.4 - Improved GPs - Basis Functions¶

Fourier Features GP - RFF | PyRFF | GPyTorch
Spherical Harmonics GPs (SHGPs) - GPfY | SphericalHarmonics | Torch-Harmonics | LocationEncoder | kNerF List
Sparse SHGPs - GPfY

1.2.5 - State Space Gaussian Processes¶

In this improvement, we add the Markovian assumption which improves the scalability. See this video for a better introduction.

Markovian GPs (MGPs) - BayesNewton | MarkovFlow | Dynamax
Sparse MGPs

1.3 - Parametric Interpolator (Coordinate-Based)¶

Learns a parametric, coordinate-based, Differentiable Interpolator for fast queries and online training.

Use Case¶

Learning - Compressed Representation, Online Learning
Estimate - Fast Queries, Online Estimation

Formulation¶

y(s,t) = f(s,t;\theta)

(2)

Algorithms¶

Baseline - SIREN
Improvements - SpatioTemporal Encoders
Research - Physics Informed, Modulated, Scalable, Stochastic
a - SIREN
b - spatial coordinate encoders
c - temporal coordinate encoders
d - modulation

Scale

Hashing

Background - TimeEmbedding, SpatialEmbeddings

1.4 - Parametric SpatioTemporal Field Interpolator (Field-Based)¶

These methods are parametric interpolators. They directly operate on the gappy fields and output a gap-free field. They are parametric which implies that they will use neural networks to some degree. Because it’s space and time, we will need physics inspired architectures which decompose the field into a spatial operator and TimeStepper. For example, for the spatial operator, we will use architectures like convolutions, transformers or graphs. For the TimeStepper, we can use convolutions, recurrent neural networks, transformers, or graphs.

y(\Omega_u, t) = f(\Omega_y, t, \theta)

(3)

Use Cases:

Learning - Fast, Compressed Interpolator, ROM, PnP Priors, Anomaly Detectors, Pretraining 4 DA
Estimation - Latent Variable Data Assimilation

Algorithms

Baseline: (Spectral) Conv, UNet, DINEOF, Convolutional Neural Operator
Improved: Deep Equilibrium Models
Research: Transformers, Graphical Neural Networks

1.4.1 - Direct CNN Models¶

We apply some simple NN models that are specifically designed to deal with masked inputs. We’re dealing with spatiotemporal data, we will directly apply convolutions. We can increase the difficulty by applying Convolutional LSTMs which is a popular architecture for spatiotemporal data. To deal with the missing data, we’ll start with some simple ad-hoc masks techniques which is similar the kernel methods. We’ll do more advanced methods like partial convolutions which are compatible with neural networks.

a - Convolutions w/ Masks - astropy | serket
b - Partial Convolutions - keras - partial conv | NVidia
c - Partial Convolution + TimeStepper - LSTMs - PConvLSTM
Appendix - Masked Losses, Interpolation Losses, Convolution Family, RNN/GRU/LSTMs

1.4.2 - Direct Transformer Models¶

Here, we will use more advanced models called transformers. We look at the same task of dealing with missing values. However, we can use patch Embeddings to deal with missing data.

a - Masked AutoEncoder - keras | keras | SST | SatMAE
b - SpatioTemporal Masked AutoEncoder - keras
Appendix - Transformer, Attention, UNet, AE, PatchEmbedding Masks, Time Embeddings

1.4.3 - Graphical Models¶

We will look at Graphical Models as a different data structure for dealing with spatiotemporal data.

Neural Spatiotemporal Forecastinf (PyTorch) | Sparse GNN

Appendix - GNN

1.4.4 - Deep Equilibrium Models¶

We will add an extra

a - DEQ from Scratch - Implicit Layers Tutorial
b - jaxopt
c - Optimistix

1.4.5 - Conditional Flow Models¶

Here, we will use conditional flow models. These are conditional stochastic models. They include architectures such as bijective, Surjective, or stochastic. The nice thing here is that we can reuse some of the previous architectures, e.g., the Conv, the partial convolutions, and/or the transformers.

Variational AutoEncoder + Masks - pyro-ppl
PriorCVAE
Stochastic Interpolants - Video | Video | Conditional Flow Matching | Stochastic Interpolants

1.5 - Parametric Dynamical Model (Field-Based)¶

In this application, we train a dynamical model that best fit the observations. The model complexity ranges from linear to nonlinear. The physics can range from a PDE to a surrogate model.

Use Cases:¶

Learning - Scientific Discovery, Surrogate Model
Estimation - Latent Variable Data Assimilation

Formulation¶

\begin{aligned} z(\Omega_z, t) &= f[z;\theta](\Omega_z,t-\delta t) \\ y(\Omega_y,t) &= h[z;\theta](\Omega_z,t) \end{aligned}

(4)

Algorithms¶

Baseline: Kalman Filter Family
Improved: PDE, Neural ODE, UDE
Research: Deep Markov Model

1.5.1 - Learning Spatial Operators¶

Look at this from a Spatiotemporal decomposition perspective. We go over the basics of a state space model including the dynamical (transition) model and the observation (emission) model. We then talk about the complexity of the system. In the case of observations only, we keep it simple with a masked. We will use a simple TimeStepper for all models, e.g., we can use a “continuous” time stepper like a traditional ODESolver or a “discrete” time stepper like Euler.

Universal Differential Equations (UDE) - Framework
a - Linear Spatial Operator
b - Convolutional (Finite Difference) Spatial Operator
c - Spectral Convolutional Spatial Operator

Appendix

Faster Neural ODEs -
Gradients - FD, AutoDiff., Adjoint/ Implicit Diff.

1.5.2 - Probabilistic Dynamical Models¶

In this section, we will look at how we can perform inference with time series. This will be useful for Reanalysis and Forecasting. A great introduction can be found here

1.5.2a - Conjugate Inference¶

Basically using conjugate priors and linear models will magically give us exact inference.

a - Linear Model + Exact Inference
- Linear Kalman Filter - Diffrax Example Dynamax Simple KF | Neural KF | KalmanNet | Training

1.5.2b - Parametric Inference¶

a.k.a. Deterministic Approximate Inference. This is a local approximation whereby we cover one mode of the potentially complex, multi-modal distribution really well. We approximate the posterior with a simpler distribution, $q(\theta;\alpha)$ These include staples like MLE, MAP, Laplace Approx, VI, and EP.

Non-Linear Model + Deterministic Approximate Inference
- Standard Approaches - EKF, UKF, ADF - Dynamax | Neural EKF | Training
- Approximate Expectation Propagation -
- Variational Approximate Inference - Slides
- Unified - Bayes-Newton

1.5.2c - Stochastic Inference¶

a.k.a. Stochastic Approximate Inference We draw samples from the posterior. This includes staples like MCMC, HMC/NUTS, SGLD, Gibbs, ESS

Non-Linear Model + Stochastic Approximate Inference

Ensemble Kalman Filter -
Particle Filter - pfilter | pc - tutorial

Appendix

Sequential Model Inference - Exact, (V)EM, (V)EP,
Packages - Nested Sampling | SGMCMC | BlackJax

1.5.3 - Latent Probabilistic Dynamical Models¶

We look at state space models in general starting with linear models.

a - Conjugate Transform (Conditional Markov Flows)
- Exact Inference - Code | Paper
b - Stochastic Transform Filter
- Stochastic Inference - ROAD-EnsKF
- Variational Inference - pyro - DMM | numpyro - DMM | DMM | PgDMM
observation operator encoder - KVAE
c -
d - Neural SDE