Skip to article frontmatterSkip to article content

Optimal Interpolation

CNRS
MEOM

This aims at finding the Best Linear Unbiased Estimator (BLUE).

Problem Setting

x=[lon1,,lonDlat,lat1,,latDlon,,time1,,timeDtime]RDlat×Dlon×Dtime\mathbf{x} = \left[\text{lon}_1, \ldots, \text{lon}_{D_\text{lat}}, \text{lat}_1, \ldots, \text{lat}_{D_\text{lon}},, \text{time}_1, \ldots, \text{time}_{D_\text{time}} \right] \in \mathbb{R}^{D_\text{lat} \times D_\text{lon} \times D_\text{time}}

For example, if we have a spatial lat-lon grid of 30x30 points and 30 time steps, then the vector is 30x30x30 which is 27,000-dimensional vector!


Optimal Interpolation (OI) seems to be the golden standard for gap-filling SSH fields with missing data. Something that is very prevalent for Satellite sensors. However, the standard OI method

μ~xt=xt+K(ytHxt)Σ~xt=(IKH)ΣxK=PH(HPH+R)1\begin{aligned} \tilde{\boldsymbol{\mu}}_{\mathbf{x}_{t}} &= \mathbf{x}_t + \mathbf{K}(\mathbf{y}_t - \mathbf{H}\mathbf{x}_t) \\ \tilde{\boldsymbol{\Sigma}}_{\mathbf{x}_{t}} &= (\mathbf{I} - \mathbf{KH})\boldsymbol{\Sigma}_\mathbf{x} \\ \mathbf{K} &= \mathbf{P} \mathbf{H}\left(\mathbf{H} \mathbf{P} \mathbf{H}^\top + \mathbf{R} \right)^{-1} \\ \end{aligned}

The K\mathbf{K} is known as the Kalman gain. The optimal K\mathbf{K} is given by:

K=ΣxyΣyy1\mathbf{K} = \boldsymbol{\Sigma}_{\mathbf{xy}} \boldsymbol{\Sigma}_{\mathbf{yy}}^{-1}

where Σyy\boldsymbol{\Sigma}_{\mathbf{yy}} is the covariance between the observations and Σxy\boldsymbol{\Sigma}_{\mathbf{xy}} is the cross-covariance between the state and the observations.

Notice how this assumes that interpolation can be done via a linear operator, H\mathbf{H}. The Kalman gain, K\mathbf{K}, represents the best interpolation via this linear problem. However, this is often insufficient in many scenarios where the data does not exhibit linear dynamics.

Data-Driven OI

So in practice, it’s often not possible

Notation

SymbolMeaning
xRx \in \mathbb{R}scalar
xRD\mathbf{x} \in \mathbb{R}^{D}vector
xRN×D\mathbf{x} \in \mathbb{R}^{N \times D}a matrix
KXX:=KRD\mathbf{K}_{\mathbf{XX}} := \mathbf{K} \in \mathbb{R}^{D}the kernel matrix
k(X,x)=kX(x)RD\boldsymbol{k}(\mathbf{X}, \mathbf{x}) = \boldsymbol{k}_{\mathbf{X}}(\mathbf{x}) \in \mathbb{R}^{D}the cross kernel matrix for some data, XRN×D\mathbf{X} \in \mathbf{R}^{N \times D}, and some new vector, x\mathbf{x}.
xRD\mathbf{x} \in \mathbb{R}^{D}vector

Setting

x(t)=M(x(t1))+η(t)y(t)=H(x(t))+ϵ(t)\begin{aligned} \boldsymbol{x}(t) &= \mathcal{M}\left( \boldsymbol{x}(t-1) \right) + \eta(t) \\ \boldsymbol{y}(t) &= \mathcal{H}(\boldsymbol{x}(t)) + \boldsymbol{\epsilon}(t) \end{aligned}

Discrete Setting

xk=xk1+ϵxyk=Hxk+ϵy\begin{aligned} \mathbf{x}_{k} &= \mathbf{x}_{k-1} + \boldsymbol{\epsilon}_{\mathbf{x}}\\ \mathbf{y}_{k} &= \mathbf{Hx}_{k} + \boldsymbol{\epsilon}_{\mathbf{y}} \end{aligned}

where:

  • ϵxN(0,P)\boldsymbol{\epsilon}_{\mathbf{x}} \sim \mathcal{N}(\mathbf{0}, \mathbf{P})
  • ϵyN(0,R)\boldsymbol{\epsilon}_{\mathbf{y}} \sim \mathcal{N}(\mathbf{0}, \mathbf{R})

O.I. Equations

K=PH(HPH+R)1μ~xt=xt+K(ytHxt)Σ~xt=(IKH)Σx\begin{aligned} \mathbf{K} &= \mathbf{P} \mathbf{H}\left(\mathbf{H} \mathbf{P} \mathbf{H}^\top + \mathbf{R} \right)^{-1} \\ \tilde{\boldsymbol{\mu}}_{\mathbf{x}_{t}} &= \mathbf{x}_t + \mathbf{K}(\mathbf{y}_t - \mathbf{H}\mathbf{x}_t) \\ \tilde{\boldsymbol{\Sigma}}_{\mathbf{x}_{t}} &= (\mathbf{I} - \mathbf{KH})\boldsymbol{\Sigma}_\mathbf{x} \end{aligned}

where:

  • y\mathbf{y} - observations
  • x\mathbf{x} - prior input
  • K\mathbf{K} - Kalman gain
  • B\mathbf{B} - noise covariance matrix from the
  • R\mathbf{R} - noise covariance matrix from the observations.

Cost Function

Lorenc (1986) showed that OI is closely related to the 3D-Var variational data assimilation

L(θ)=(xxb)B1(xxb)+(yHx)R1(yHx)\mathcal{L}(\boldsymbol{\theta}) = (\mathbf{x} - \mathbf{x}^b)^\top \mathbf{B}^{-1} (\mathbf{x} - \mathbf{x}^b) + (\mathbf{y} - \mathbf{Hx})^\top \mathbf{R}^{-1} (\mathbf{y} - \mathbf{Hx})

where θ={B,R}\boldsymbol{\theta} = \{ \mathbf{B}, \mathbf{R}\}


In Practice

It doesn’t make sense to assume that we have an affine linear transformation if indeed our model cannot be described by a linear transformation. Instead, we can try to directly learn this from data. We could also directly estimate this from the data. Let’s take some coordinates, {Xobs,Yobs}={xobs,yobs}i=1N\{\mathbf{X}_{obs}, \mathbf{Y}_{obs} \} = \{\mathbf{x}_{obs}, \mathbf{y}_{obs} \}_{i=1}^N which are observed. We will learn a model that best fits the observations. And then we will find some interpolation method to fill in the gaps.

Historical Literature:

  • Bretherton et al. 1976
  • McIntosh, 1990
  • Le Traon et al. 1998

Modern Literature:

  • Kernel Methods
  • Kernel Interpolation

We can write an exact equation for the best least squares linear estimator, f(x)\boldsymbol{f}(\mathbf{x}):

f(x)=μx+i=1Dxj=1DyAij1Cxj(yiμxj)\boldsymbol{f}(\mathbf{x}^*)= \boldsymbol{\mu}_\mathbf{x} + \sum_{i=1}^{D_\mathbf{x}} \sum_{j=1}^{D_\mathbf{y}} \mathbf{A}_{ij}^{-1}\mathbf{C}_{\mathbf{x}j}(\mathbf{y}_i - \boldsymbol{\mu}_{\mathbf{x}j})

where Aij\mathbf{A}_{ij} is the covariance between the observation locations:

Aij=xobs,xobs+ϵ,ϵ\mathbf{A}_{ij}= \langle \mathbf{x}_{obs}, \mathbf{x'}_{obs} \rangle + \langle \boldsymbol{\epsilon}, \boldsymbol{\epsilon}' \rangle

and Cxj\mathbf{C}_{\mathbf{x}j} is the cross covariance for the locations to be estimated and the observation locations:

Cxj=x,xobs\mathbf{C}_{\mathbf{x}j} = \langle \mathbf{x}^*, \mathbf{x}_{obs} \rangle

Reformulation

We can write the full equation (in vector form) as follows:

f(x)=kX(x)(KXX+σ2I)1Yobs\boldsymbol{f}(\mathbf{x}^*) = \mathbf{k_{X}}(\mathbf{x}^*) (\mathbf{K_{XX}} + \sigma^2\mathbf{I})^{-1}\mathbf{Y}_{obs}

where:

  • KXX\mathbf{K_{XX}} - the coordinates for where we have observations, YobsRN×D\mathbf{Y}_{obs} \in \mathbb{R}^{N \times D}.
  • kX()\mathbf{k_{X}}(\cdot) - the cross-kernel for the new coordinates, xRD\mathbf{x}^* \in \mathbb{R}^{D}, and the old coordinates, XRN×D\mathbf{X} \in \mathbb{R}^{N\times D}.

We have a kernel function which is the correlation between the observation locations, x\mathbf{x}.

Kij=xobs,xobs\mathbf{K}_{ij}= \langle \mathbf{x}_{obs}, \mathbf{x'}_{obs} \rangle

where XobsRN×D\mathbf{X}_{obs} \in \mathbb{R}^{N \times D}.

If we only look at the elements dependent upon the observations, Yobs\mathbf{Y}_{obs}, we can call this, α\boldsymbol{\alpha}.

α=(Kxx+σ2I)1Yobs\boldsymbol{\alpha} = (\mathbf{K_{xx}} + \sigma^2\mathbf{I})^{-1}\mathbf{Y}_{obs}

where αRN×D\boldsymbol{\alpha} \in \mathbb{R}^{N \times D}. We can solve for α\boldsymbol{\alpha} exactly. To find the best free parameters, θ={σ2,α}\boldsymbol{\theta} = \{ \sigma^2, \boldsymbol{\alpha} \}, we can use cross-validation.

Note: This comes from kernel methods!


And now we can make predictions

m(x)=kX(x)α\begin{aligned} \boldsymbol{m}(\mathbf{x}^*) &= \boldsymbol{k}_{\mathbf{X}}(\mathbf{x}^*) \boldsymbol{\alpha} \\ \end{aligned}

and we can also estimate the covariance.

Σ2(x,x)=σ2+k(x,x)+kX(x)(KXX+σ2I)kX(x)\begin{aligned} \boldsymbol{\Sigma}^2(\mathbf{x}^*, \mathbf{x'}^*) &= \sigma^2 + \boldsymbol{k}(\mathbf{x}^*, \mathbf{x}^*) + \boldsymbol{k}_{\mathbf{X}}(\mathbf{x}^*) (\mathbf{K_{XX}} + \sigma^2\mathbf{I}) \boldsymbol{k}_{\mathbf{X}}(\mathbf{x}^*)^\top \end{aligned}

Notation

  • k(x,y):X×YR\boldsymbol{k}(\mathbf{x}, \mathbf{y}): \mathcal{X}\times \mathcal{Y} \rightarrow \mathbb{R} - kernel function that takes in two vectors and returns a scalar value.
  • k(x):X×:RD\mathbf{k}(\mathbf{x}_*): \mathcal{X} \times : \rightarrow \mathbb{R}^{D}
  • KXXRDx×Dx\mathbf{K_{XX}} \in \mathbb{R}^{D_{\mathbf{x}} \times D_\mathbf{x}} - kernel matrix (gram matrix) which is the result of
  • KXYRDx×Dy\mathbf{K_{XY}} \in \mathbb{R}^{D_{\mathbf{x}} \times D_\mathbf{y}} - cross kernel matrix

Kernel Functions

Linear Kernel

k(x,y)=xy\boldsymbol{k}(\mathbf{x},\mathbf{y}) = \mathbf{x}^\top \mathbf{y}

Gaussian Kernel (aka Radial Basis Function, Matern 32)

k(x,y)=exp(γxy22)\boldsymbol{k}(\mathbf{x},\mathbf{y}) = \exp \left( - \gamma||\mathbf{x} - \mathbf{y}||_2^2 \right)

Note: This is known as the Gaussian kernel if γ=1σ2\gamma = \frac{1}{\sigma^2}.

Laplacian Kernel

k(x,y)=exp(γxy1)\boldsymbol{k}(\mathbf{x},\mathbf{y}) = \exp \left( - \gamma||\mathbf{x} - \mathbf{y}||_1 \right)

Polynomial Kernel

k(x,y)=(γxy+c0)d\boldsymbol{k}(\mathbf{x},\mathbf{y}) = \left( \gamma\mathbf{x}^\top \mathbf{y} + c_0 \right)^d

Scaling

The bottleneck in the above methods is the inversion of the KXX1\mathbf{K_{XX}}^{-1} matrix which is order O(N3)\mathcal{O}(N^3).

We can assume that our kernel matrix KXX\mathbf{K_{XX}} can be approximated via a low-rank matrix:

KXXKXZKXZ\mathbf{K_{XX}} \approx \mathbf{K_{XZ}}\mathbf{K_{XZ}}^\top

where KXZRN×r\mathbf{K_{XZ}} \in \mathbb{R}^{N \times r}

Similarly, we can also write:

KXXKXZKZZ1KXZ\mathbf{K_{XX}} \approx \mathbf{K_{XZ}}\mathbf{K_{ZZ}}^{-1}\mathbf{K_{XZ}}^\top
kr(X)=\mathbf{k}_r (\mathbf{X}) =

So then we can rewrite our solution to be:

α=(KXZKXZ+σ2KZZ)KXZYobs\boldsymbol{\alpha} = \left( \mathbf{K_{XZ}}^\top \mathbf{K_{XZ}} + \sigma^2 \mathbf{K_{ZZ}} \right)^{\dagger} \mathbf{K_{XZ}}^\top \mathbf{Y}_{obs}

Now to make predictions, we can simply write:

m(x)=kr(x)α\boldsymbol{m}(\mathbf{x}^*) = \boldsymbol{k}_{r}(\mathbf{x}^*)\boldsymbol{\alpha}

This is much cheaper:

O(nm3+m3)<O(n3)\mathcal{O}(nm^3 + m^3) < \mathcal{O}(n^3)

Predictive Mean

m(x)=kzα\boldsymbol{m}(\mathbf{x}^*) = \boldsymbol{k}_{\mathbf{z}}\boldsymbol{\alpha}

Predictive Covariance

Σ2(x,x)=k(x,x)kz(x)KZZ1kz(x)+kz(x)(KZZ+σ2KZXKXZ)kz(x)\boldsymbol{\Sigma}^2(\mathbf{x}^*, \mathbf{x'}^*) = \boldsymbol{k}(\mathbf{x}^*, \mathbf{x'}^*) - \boldsymbol{k}_{\mathbf{z}}(\mathbf{x}^*)^\top \mathbf{K_{ZZ}}^{-1}\boldsymbol{k}_{\mathbf{z}}(\mathbf{x}^*)^\top + \boldsymbol{k}_{\mathbf{z}}(\mathbf{x}^*) \left( \mathbf{K_{ZZ}} + \sigma^{-2} \mathbf{K_{ZX}K_{XZ}} \right) \boldsymbol{k}_{\mathbf{z}}(\mathbf{x}^*)^\top

where:

  • KXXRN×N\mathbf{K_{\mathbf{XX}}} \in \mathbb{R}^{N \times N}
  • KZZRr×r\mathbf{K_{\mathbf{ZZ}}} \in \mathbb{R}^{r \times r}
  • KZXRr×N\mathbf{K_{\mathbf{ZX}}} \in \mathbb{R}^{r \times N}
  • KXZRN×r\mathbf{K_{\mathbf{XZ}}} \in \mathbb{R}^{N \times r}
  • kZ(x)Rr\boldsymbol{k}_{\mathbf{Z}}(\mathbf{x}) \in \mathbb{R}^{r}

Resources:

  • Less is More: Nystr ̈om Computational Regularization
  • Using Nystrom to Speed Up Kernel Machines

  • Linear -> Nystrom Approximation
  • Polynomial -> Nystrom / TensorSketech
  • Gaussian -> Nystrom / RFF

Potential Projects

  • Kernel Methods - solve exactly with cross validation for the best parameters
  • Gaussian Processes - solve with maximum likelihood
  • Kernel Functions - explore different kernel functions and see how they work (e.g. linear, Gaussian, Laplacian, Haversine, Matern Family)
  • Sparse Methods (Kernel) - use kernel approximation schemes (e.g. Nystrom, RFF)
  • Sparse GP Methods - use inducing points to see how they work.

Model

We have a data model that is non-parametric. It means that we fit a model that has many local fits where the number of parameters may exceed the number of data points.

yi=f(xi)+ϵi,i=1,2,,N\mathbf{y}_i = \boldsymbol{f}(\mathbf{x}_i) + \boldsymbol{\epsilon}_i, \hspace{10mm} i=1,2, \ldots, N

where:

  • f()\boldsymbol{f}(\cdot) - the regression function
  • xi\mathbf{x}_i - the sampling position, e.g. latitude, longitude, and/or time
  • yi\mathbf{y}_i - the sample
  • ϵi\boldsymbol{\epsilon}_i - zero mean, iid noise

Kernel Methods

  • Connections and Equivalences between the Nyström Method and Sparse Variational Gaussian Processes - Arxiv
  • Gaussian Processes for Machine Learning - PDF
  • PyKrige
  • Spatial Analysis Made Easy with Linear Regression and Kernels - PDF
  • RFFs - blog
  • GP Regression - blog | blog
  • An exact kernel framework for spatio-temporal dynamics - arxiv
  • 4DVarNet - Broadcast | Loops | Batch
  • Spatio-Temporal VGPs - [arxiv](Spatio-Temporal Variational Gaussian Processes)
  • Space-Time KErnels - prexi
  • Falkon - python
  • Kernel methods & sparse methods for computer vision - bach - prezi
  • Kernel methods through the roof: handling billions of points efficiently - arxiv
  • KRR Demo - sklearn

Theory

  • Machine Learning with Kernel Methods - Slides
  • Kernel Ridge Regression Course - Class

References

  • Gaussian Process Machine Learning and Kriging for Groundwater Salinity Interpolation - Cui et al (2021) - Paper

They use different kernels with different model outputs to try and interpolate. They found that more expressive kernels with combinations of inputs provided better models.

Learning

  • How to Scale Up Kernel Methods to Be As Good As Deep Neural Nets - Lu et al (2015)
  • On the Benefits of Large Learning Rates for Kernel Methods - Beugnot et al (2022)
  • Learning Rate Annealing Can Provably Help Generalization, Even for Convex Problems - Nakkiran et al (2020)

Motivation

  • Demo for Scalable Kernel Features w/ Sklearn - Blog

Software Planning


RFF


Scalable Computation

  • Kernel methods through the roof: handling billions of points efficiently - Meanti et al (2020) - arxiv | video
  • Efficient Hyperparameter Tuning for Large Scale Kernel Ridge Regression - Meanti et al (2022) - arxiv
  • Scalable Kernel Methods via Doubly Stochastic Gradients - Dai et al (2015)
  • Making Large-Scale Nystr¨om Approximation Possible - Li et al (2010) | Blog
  • Large-Scale Nyström Kernel Matrix Approximation Using Randomized SVD - Li et al (2014)
  • Practical Algorithms for Latent Variable Models - Gundersen (Thesis) (2021) - thesis


Spatio-Temporal Kernels

  • Kernel Regression for Image Processing and Reconstruction - Takeda et al (2007) - pdf
  • Composite Kernels for HSI classification - Gustau (2006) - Paper

Connections

  • Deep Gaussian Markov Random Fields - Siden & Lindsten (2020) - arxiv

Markovian Gaussian Processes

  • Spatio-Temporal Variational Gaussian Processes - Hamelijnick et al (2021) - arxiv (JAX)
  • Variational Gaussian Process State-Space Models - Frigola et al (2014) - arxiv

Periodic Activation Functions

  • NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis - Mildenhall et al (2020)
  • Implicit Neural Representations with Periodic Activation Functions - Sitzmann et al (2020) - arxiv | page | eccv vid | blog

Kernel Functions