Optimal Interpolation - Research Notebook

This aims at finding the Best Linear Unbiased Estimator (BLUE).

Problem Setting¶

\mathbf{x} = \left[\text{lon}_1, \ldots, \text{lon}_{D_\text{lat}}, \text{lat}_1, \ldots, \text{lat}_{D_\text{lon}},, \text{time}_1, \ldots, \text{time}_{D_\text{time}} \right] \in \mathbb{R}^{D_\text{lat} \times D_\text{lon} \times D_\text{time}}

(1)

For example, if we have a spatial lat-lon grid of 30x30 points and 30 time steps, then the vector is 30x30x30 which is 27,000-dimensional vector!

Optimal Interpolation (OI) seems to be the golden standard for gap-filling SSH fields with missing data. Something that is very prevalent for Satellite sensors. However, the standard OI method

\begin{aligned} \tilde{\boldsymbol{\mu}}_{\mathbf{x}_{t}} &= \mathbf{x}_t + \mathbf{K}(\mathbf{y}_t - \mathbf{H}\mathbf{x}_t) \\ \tilde{\boldsymbol{\Sigma}}_{\mathbf{x}_{t}} &= (\mathbf{I} - \mathbf{KH})\boldsymbol{\Sigma}_\mathbf{x} \\ \mathbf{K} &= \mathbf{P} \mathbf{H}\left(\mathbf{H} \mathbf{P} \mathbf{H}^\top + \mathbf{R} \right)^{-1} \\ \end{aligned}

(2)

The $\mathbf{K}$ is known as the Kalman gain. The optimal $\mathbf{K}$ is given by:

\mathbf{K} = \boldsymbol{\Sigma}_{\mathbf{xy}} \boldsymbol{\Sigma}_{\mathbf{yy}}^{-1}

(3)

where $\boldsymbol{\Sigma}_{\mathbf{yy}}$ is the covariance between the observations and $\boldsymbol{\Sigma}_{\mathbf{xy}}$ is the cross-covariance between the state and the observations.

Notice how this assumes that interpolation can be done via a linear operator, $\mathbf{H}$ . The Kalman gain, $\mathbf{K}$ , represents the best interpolation via this linear problem. However, this is often insufficient in many scenarios where the data does not exhibit linear dynamics.

Data-Driven OI¶

So in practice, it’s often not possible

Notation¶

Symbol	Meaning
$x \in \mathbb{R}$	scalar
$\mathbf{x} \in \mathbb{R}^{D}$	vector
$\mathbf{x} \in \mathbb{R}^{N \times D}$	a matrix
$\mathbf{K}_{\mathbf{XX}} := \mathbf{K} \in \mathbb{R}^{D}$	the kernel matrix
$\boldsymbol{k}(\mathbf{X}, \mathbf{x}) = \boldsymbol{k}_{\mathbf{X}}(\mathbf{x}) \in \mathbb{R}^{D}$	the cross kernel matrix for some data, $\mathbf{X} \in \mathbf{R}^{N \times D}$ , and some new vector, $\mathbf{x}$ .
$\mathbf{x} \in \mathbb{R}^{D}$	vector

Setting¶

\begin{aligned} \boldsymbol{x}(t) &= \mathcal{M}\left( \boldsymbol{x}(t-1) \right) + \eta(t) \\ \boldsymbol{y}(t) &= \mathcal{H}(\boldsymbol{x}(t)) + \boldsymbol{\epsilon}(t) \end{aligned}

(4)

Discrete Setting

\begin{aligned} \mathbf{x}_{k} &= \mathbf{x}_{k-1} + \boldsymbol{\epsilon}_{\mathbf{x}}\\ \mathbf{y}_{k} &= \mathbf{Hx}_{k} + \boldsymbol{\epsilon}_{\mathbf{y}} \end{aligned}

(5)

where:

$\boldsymbol{\epsilon}_{\mathbf{x}} \sim \mathcal{N}(\mathbf{0}, \mathbf{P})$
$\boldsymbol{\epsilon}_{\mathbf{y}} \sim \mathcal{N}(\mathbf{0}, \mathbf{R})$

O.I. Equations¶

\begin{aligned} \mathbf{K} &= \mathbf{P} \mathbf{H}\left(\mathbf{H} \mathbf{P} \mathbf{H}^\top + \mathbf{R} \right)^{-1} \\ \tilde{\boldsymbol{\mu}}_{\mathbf{x}_{t}} &= \mathbf{x}_t + \mathbf{K}(\mathbf{y}_t - \mathbf{H}\mathbf{x}_t) \\ \tilde{\boldsymbol{\Sigma}}_{\mathbf{x}_{t}} &= (\mathbf{I} - \mathbf{KH})\boldsymbol{\Sigma}_\mathbf{x} \end{aligned}

(6)

where:

$\mathbf{y}$ - observations
$\mathbf{x}$ - prior input
$\mathbf{K}$ - Kalman gain
$\mathbf{B}$ - noise covariance matrix from the
$\mathbf{R}$ - noise covariance matrix from the observations.

Cost Function¶

Lorenc (1986) showed that OI is closely related to the 3D-Var variational data assimilation

\mathcal{L}(\boldsymbol{\theta}) = (\mathbf{x} - \mathbf{x}^b)^\top \mathbf{B}^{-1} (\mathbf{x} - \mathbf{x}^b) + (\mathbf{y} - \mathbf{Hx})^\top \mathbf{R}^{-1} (\mathbf{y} - \mathbf{Hx})

(7)

where $\boldsymbol{\theta} = \{ \mathbf{B}, \mathbf{R}\}$

In Practice¶

It doesn’t make sense to assume that we have an affine linear transformation if indeed our model cannot be described by a linear transformation. Instead, we can try to directly learn this from data. We could also directly estimate this from the data. Let’s take some coordinates, $\{\mathbf{X}_{obs}, \mathbf{Y}_{obs} \} = \{\mathbf{x}_{obs}, \mathbf{y}_{obs} \}_{i=1}^N$ which are observed. We will learn a model that best fits the observations. And then we will find some interpolation method to fill in the gaps.

Historical Literature:

Bretherton et al. 1976
McIntosh, 1990
Le Traon et al. 1998

Modern Literature:

Kernel Methods
Kernel Interpolation

We can write an exact equation for the best least squares linear estimator, $\boldsymbol{f}(\mathbf{x})$ :

\boldsymbol{f}(\mathbf{x}^*)= \boldsymbol{\mu}_\mathbf{x} + \sum_{i=1}^{D_\mathbf{x}} \sum_{j=1}^{D_\mathbf{y}} \mathbf{A}_{ij}^{-1}\mathbf{C}_{\mathbf{x}j}(\mathbf{y}_i - \boldsymbol{\mu}_{\mathbf{x}j})

(8)

where $\mathbf{A}_{ij}$ is the covariance between the observation locations:

\mathbf{A}_{ij}= \langle \mathbf{x}_{obs}, \mathbf{x'}_{obs} \rangle + \langle \boldsymbol{\epsilon}, \boldsymbol{\epsilon}' \rangle

(9)

and $\mathbf{C}_{\mathbf{x}j}$ is the cross covariance for the locations to be estimated and the observation locations:

\mathbf{C}_{\mathbf{x}j} = \langle \mathbf{x}^*, \mathbf{x}_{obs} \rangle

(10)

Reformulation¶

We can write the full equation (in vector form) as follows:

\boldsymbol{f}(\mathbf{x}^*) = \mathbf{k_{X}}(\mathbf{x}^*) (\mathbf{K_{XX}} + \sigma^2\mathbf{I})^{-1}\mathbf{Y}_{obs}

(11)

where:

$\mathbf{K_{XX}}$ - the coordinates for where we have observations, $\mathbf{Y}_{obs} \in \mathbb{R}^{N \times D}$ .
$\mathbf{k_{X}}(\cdot)$ - the cross-kernel for the new coordinates, $\mathbf{x}^* \in \mathbb{R}^{D}$ , and the old coordinates, $\mathbf{X} \in \mathbb{R}^{N\times D}$ .

We have a kernel function which is the correlation between the observation locations, $\mathbf{x}$ .

\mathbf{K}_{ij}= \langle \mathbf{x}_{obs}, \mathbf{x'}_{obs} \rangle

(12)

where $\mathbf{X}_{obs} \in \mathbb{R}^{N \times D}$ .

If we only look at the elements dependent upon the observations, $\mathbf{Y}_{obs}$ , we can call this, $\boldsymbol{\alpha}$ .

\boldsymbol{\alpha} = (\mathbf{K_{xx}} + \sigma^2\mathbf{I})^{-1}\mathbf{Y}_{obs}

(13)

where $\boldsymbol{\alpha} \in \mathbb{R}^{N \times D}$ . We can solve for $\boldsymbol{\alpha}$ exactly. To find the best free parameters, $\boldsymbol{\theta} = \{ \sigma^2, \boldsymbol{\alpha} \}$ , we can use cross-validation.

Note: This comes from kernel methods!

And now we can make predictions

\begin{aligned} \boldsymbol{m}(\mathbf{x}^*) &= \boldsymbol{k}_{\mathbf{X}}(\mathbf{x}^*) \boldsymbol{\alpha} \\ \end{aligned}

(14)

and we can also estimate the covariance.

\begin{aligned} \boldsymbol{\Sigma}^2(\mathbf{x}^*, \mathbf{x'}^*) &= \sigma^2 + \boldsymbol{k}(\mathbf{x}^*, \mathbf{x}^*) + \boldsymbol{k}_{\mathbf{X}}(\mathbf{x}^*) (\mathbf{K_{XX}} + \sigma^2\mathbf{I}) \boldsymbol{k}_{\mathbf{X}}(\mathbf{x}^*)^\top \end{aligned}

(15)

Notation¶

$\boldsymbol{k}(\mathbf{x}, \mathbf{y}): \mathcal{X}\times \mathcal{Y} \rightarrow \mathbb{R}$ - kernel function that takes in two vectors and returns a scalar value.
$\mathbf{k}(\mathbf{x}_*): \mathcal{X} \times : \rightarrow \mathbb{R}^{D}$
$\mathbf{K_{XX}} \in \mathbb{R}^{D_{\mathbf{x}} \times D_\mathbf{x}}$ - kernel matrix (gram matrix) which is the result of
$\mathbf{K_{XY}} \in \mathbb{R}^{D_{\mathbf{x}} \times D_\mathbf{y}}$ - cross kernel matrix

Kernel Functions¶

Linear Kernel

\boldsymbol{k}(\mathbf{x},\mathbf{y}) = \mathbf{x}^\top \mathbf{y}

(16)

Gaussian Kernel (aka Radial Basis Function, Matern 32)

\boldsymbol{k}(\mathbf{x},\mathbf{y}) = \exp \left( - \gamma||\mathbf{x} - \mathbf{y}||_2^2 \right)

(17)

Note: This is known as the Gaussian kernel if $\gamma = \frac{1}{\sigma^2}$ .

Laplacian Kernel

\boldsymbol{k}(\mathbf{x},\mathbf{y}) = \exp \left( - \gamma||\mathbf{x} - \mathbf{y}||_1 \right)

(18)

Polynomial Kernel

\boldsymbol{k}(\mathbf{x},\mathbf{y}) = \left( \gamma\mathbf{x}^\top \mathbf{y} + c_0 \right)^d

(19)

Scaling¶

The bottleneck in the above methods is the inversion of the $\mathbf{K_{XX}}^{-1}$ matrix which is order $\mathcal{O}(N^3)$ .

We can assume that our kernel matrix $\mathbf{K_{XX}}$ can be approximated via a low-rank matrix:

\mathbf{K_{XX}} \approx \mathbf{K_{XZ}}\mathbf{K_{XZ}}^\top

(20)

where $\mathbf{K_{XZ}} \in \mathbb{R}^{N \times r}$

Similarly, we can also write:

\mathbf{K_{XX}} \approx \mathbf{K_{XZ}}\mathbf{K_{ZZ}}^{-1}\mathbf{K_{XZ}}^\top

(21)

\mathbf{k}_r (\mathbf{X}) =

(22)

So then we can rewrite our solution to be:

\boldsymbol{\alpha} = \left( \mathbf{K_{XZ}}^\top \mathbf{K_{XZ}} + \sigma^2 \mathbf{K_{ZZ}} \right)^{\dagger} \mathbf{K_{XZ}}^\top \mathbf{Y}_{obs}

(23)

Now to make predictions, we can simply write:

\boldsymbol{m}(\mathbf{x}^*) = \boldsymbol{k}_{r}(\mathbf{x}^*)\boldsymbol{\alpha}

(24)

This is much cheaper:

\mathcal{O}(nm^3 + m^3) < \mathcal{O}(n^3)

(25)

Predictive Mean

\boldsymbol{m}(\mathbf{x}^*) = \boldsymbol{k}_{\mathbf{z}}\boldsymbol{\alpha}

(26)

Predictive Covariance

\boldsymbol{\Sigma}^2(\mathbf{x}^*, \mathbf{x'}^*) = \boldsymbol{k}(\mathbf{x}^*, \mathbf{x'}^*) - \boldsymbol{k}_{\mathbf{z}}(\mathbf{x}^*)^\top \mathbf{K_{ZZ}}^{-1}\boldsymbol{k}_{\mathbf{z}}(\mathbf{x}^*)^\top + \boldsymbol{k}_{\mathbf{z}}(\mathbf{x}^*) \left( \mathbf{K_{ZZ}} + \sigma^{-2} \mathbf{K_{ZX}K_{XZ}} \right) \boldsymbol{k}_{\mathbf{z}}(\mathbf{x}^*)^\top

(27)

where:

$\mathbf{K_{\mathbf{XX}}} \in \mathbb{R}^{N \times N}$
$\mathbf{K_{\mathbf{ZZ}}} \in \mathbb{R}^{r \times r}$
$\mathbf{K_{\mathbf{ZX}}} \in \mathbb{R}^{r \times N}$
$\mathbf{K_{\mathbf{XZ}}} \in \mathbb{R}^{N \times r}$
$\boldsymbol{k}_{\mathbf{Z}}(\mathbf{x}) \in \mathbb{R}^{r}$

Resources:

Less is More: Nystr ̈om Computational Regularization
Using Nystrom to Speed Up Kernel Machines

Linear -> Nystrom Approximation
Polynomial -> Nystrom / TensorSketech
Gaussian -> Nystrom / RFF

Potential Projects¶

Kernel Methods - solve exactly with cross validation for the best parameters
Gaussian Processes - solve with maximum likelihood
Kernel Functions - explore different kernel functions and see how they work (e.g. linear, Gaussian, Laplacian, Haversine, Matern Family)
Sparse Methods (Kernel) - use kernel approximation schemes (e.g. Nystrom, RFF)
Sparse GP Methods - use inducing points to see how they work.

Model¶

We have a data model that is non-parametric. It means that we fit a model that has many local fits where the number of parameters may exceed the number of data points.

\mathbf{y}_i = \boldsymbol{f}(\mathbf{x}_i) + \boldsymbol{\epsilon}_i, \hspace{10mm} i=1,2, \ldots, N

(28)

where:

$\boldsymbol{f}(\cdot)$ - the regression function
$\mathbf{x}_i$ - the sampling position, e.g. latitude, longitude, and/or time
$\mathbf{y}_i$ - the sample
$\boldsymbol{\epsilon}_i$ - zero mean, iid noise

Kernel Methods¶

Connections and Equivalences between the Nyström Method and Sparse Variational Gaussian Processes - Arxiv
Gaussian Processes for Machine Learning - PDF
PyKrige
Spatial Analysis Made Easy with Linear Regression and Kernels - PDF
RFFs - blog
GP Regression - blog | blog
An exact kernel framework for spatio-temporal dynamics - arxiv
4DVarNet - Broadcast | Loops | Batch
Spatio-Temporal VGPs - [arxiv](Spatio-Temporal Variational Gaussian Processes)
Space-Time KErnels - prexi
Falkon - python
Kernel methods & sparse methods for computer vision - bach - prezi
Kernel methods through the roof: handling billions of points efficiently - arxiv
KRR Demo - sklearn

Theory

Machine Learning with Kernel Methods - Slides
Kernel Ridge Regression Course - Class

References

Gaussian Process Machine Learning and Kriging for Groundwater Salinity Interpolation - Cui et al (2021) - Paper

They use different kernels with different model outputs to try and interpolate. They found that more expressive kernels with combinations of inputs provided better models.

Learning

How to Scale Up Kernel Methods to Be As Good As Deep Neural Nets - Lu et al (2015)
On the Benefits of Large Learning Rates for Kernel Methods - Beugnot et al (2022)
Learning Rate Annealing Can Provably Help Generalization, Even for Convex Problems - Nakkiran et al (2020)

Motivation

Demo for Scalable Kernel Features w/ Sklearn - Blog

Software Planning

TinyGP
kernellib
jaxkern
GPFlow Sparse GP for Spatio-Temporal Data - Demo
SKIP Example - Demo
LOO GPyTorch - Code
LOO Sklearn - Code

RFF

Sampling RFF Efficiently - Blog | VFE GP - Blog

Scalable Computation

Kernel methods through the roof: handling billions of points efficiently - Meanti et al (2020) - arxiv | video
Efficient Hyperparameter Tuning for Large Scale Kernel Ridge Regression - Meanti et al (2022) - arxiv
Scalable Kernel Methods via Doubly Stochastic Gradients - Dai et al (2015)
Making Large-Scale Nystr¨om Approximation Possible - Li et al (2010) | Blog
Large-Scale Nyström Kernel Matrix Approximation Using Randomized SVD - Li et al (2014)
Practical Algorithms for Latent Variable Models - Gundersen (Thesis) (2021) - thesis

Kernel Ridge Regression - Falkon | Regression | Hyperparam | Auto Hyperparam | Custom Kernel | Custom Losses
Kernel Interpolation - KeOps Demo (Numpy/PyTorch)
Kernel Solve - KeOps Demo (Numpy/PyTorch)
Block-Matrices Reductions - KeOps Demo (Numpy/PyTorch)

Spatio-Temporal Kernels

Kernel Regression for Image Processing and Reconstruction - Takeda et al (2007) - pdf
Composite Kernels for HSI classification - Gustau (2006) - Paper

Connections

Deep Gaussian Markov Random Fields - Siden & Lindsten (2020) - arxiv

Markovian Gaussian Processes

Spatio-Temporal Variational Gaussian Processes - Hamelijnick et al (2021) - arxiv (JAX)
Variational Gaussian Process State-Space Models - Frigola et al (2014) - arxiv

Periodic Activation Functions

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis - Mildenhall et al (2020)
Implicit Neural Representations with Periodic Activation Functions - Sitzmann et al (2020) - arxiv | page | eccv vid | blog
- PyTorch | JAX | Lucidrains

Kernel Functions¶