Skip to article frontmatterSkip to article content

Optimal Interpolation

CNRS
MEOM

This aims at finding the Best Linear Unbiased Estimator (BLUE).

Problem Setting

x=[lon1,,lonDlat,lat1,,latDlon,,time1,,timeDtime]RDlat×Dlon×Dtime\mathbf{x} = \left[\text{lon}_1, \ldots, \text{lon}_{D_\text{lat}}, \text{lat}_1, \ldots, \text{lat}_{D_\text{lon}},, \text{time}_1, \ldots, \text{time}_{D_\text{time}} \right] \in \mathbb{R}^{D_\text{lat} \times D_\text{lon} \times D_\text{time}}

For example, if we have a spatial lat-lon grid of 30x30 points and 30 time steps, then the vector is 30x30x30 which is 27,000-dimensional vector!


Optimal Interpolation (OI) seems to be the golden standard for gap-filling SSH fields with missing data. Something that is very prevalent for Satellite sensors. However, the standard OI method

μ~xt=xt+K(ytHxt)Σ~xt=(IKH)ΣxK=PH(HPH+R)1\begin{aligned} \tilde{\boldsymbol{\mu}}_{\mathbf{x}_{t}} &= \mathbf{x}_t + \mathbf{K}(\mathbf{y}_t - \mathbf{H}\mathbf{x}_t) \\ \tilde{\boldsymbol{\Sigma}}_{\mathbf{x}_{t}} &= (\mathbf{I} - \mathbf{KH})\boldsymbol{\Sigma}_\mathbf{x} \\ \mathbf{K} &= \mathbf{P} \mathbf{H}\left(\mathbf{H} \mathbf{P} \mathbf{H}^\top + \mathbf{R} \right)^{-1} \\ \end{aligned}

The K\mathbf{K} is known as the Kalman gain. The optimal K\mathbf{K} is given by:

K=ΣxyΣyy1\mathbf{K} = \boldsymbol{\Sigma}_{\mathbf{xy}} \boldsymbol{\Sigma}_{\mathbf{yy}}^{-1}

where Σyy\boldsymbol{\Sigma}_{\mathbf{yy}} is the covariance between the observations and Σxy\boldsymbol{\Sigma}_{\mathbf{xy}} is the cross-covariance between the state and the observations.

Notice how this assumes that interpolation can be done via a linear operator, H\mathbf{H}. The Kalman gain, K\mathbf{K}, represents the best interpolation via this linear problem. However, this is often insufficient in many scenarios where the data does not exhibit linear dynamics.

Data-Driven OI

So in practice, it’s often not possible

Notation

SymbolMeaning
xRx \in \mathbb{R}scalar
xRD\mathbf{x} \in \mathbb{R}^{D}vector
xRN×D\mathbf{x} \in \mathbb{R}^{N \times D}a matrix
KXX:=KRD\mathbf{K}_{\mathbf{XX}} := \mathbf{K} \in \mathbb{R}^{D}the kernel matrix
k(X,x)=kX(x)RD\boldsymbol{k}(\mathbf{X}, \mathbf{x}) = \boldsymbol{k}_{\mathbf{X}}(\mathbf{x}) \in \mathbb{R}^{D}the cross kernel matrix for some data, XRN×D\mathbf{X} \in \mathbf{R}^{N \times D}, and some new vector, x\mathbf{x}.
xRD\mathbf{x} \in \mathbb{R}^{D}vector

Setting

x(t)=M(x(t1))+η(t)y(t)=H(x(t))+ϵ(t)\begin{aligned} \boldsymbol{x}(t) &= \mathcal{M}\left( \boldsymbol{x}(t-1) \right) + \eta(t) \\ \boldsymbol{y}(t) &= \mathcal{H}(\boldsymbol{x}(t)) + \boldsymbol{\epsilon}(t) \end{aligned}

Discrete Setting

xk=xk1+ϵxyk=Hxk+ϵy\begin{aligned} \mathbf{x}_{k} &= \mathbf{x}_{k-1} + \boldsymbol{\epsilon}_{\mathbf{x}}\\ \mathbf{y}_{k} &= \mathbf{Hx}_{k} + \boldsymbol{\epsilon}_{\mathbf{y}} \end{aligned}

where:


O.I. Equations

K=PH(HPH+R)1μ~xt=xt+K(ytHxt)Σ~xt=(IKH)Σx\begin{aligned} \mathbf{K} &= \mathbf{P} \mathbf{H}\left(\mathbf{H} \mathbf{P} \mathbf{H}^\top + \mathbf{R} \right)^{-1} \\ \tilde{\boldsymbol{\mu}}_{\mathbf{x}_{t}} &= \mathbf{x}_t + \mathbf{K}(\mathbf{y}_t - \mathbf{H}\mathbf{x}_t) \\ \tilde{\boldsymbol{\Sigma}}_{\mathbf{x}_{t}} &= (\mathbf{I} - \mathbf{KH})\boldsymbol{\Sigma}_\mathbf{x} \end{aligned}

where:

Cost Function

Lorenc (1986) showed that OI is closely related to the 3D-Var variational data assimilation

L(θ)=(xxb)B1(xxb)+(yHx)R1(yHx)\mathcal{L}(\boldsymbol{\theta}) = (\mathbf{x} - \mathbf{x}^b)^\top \mathbf{B}^{-1} (\mathbf{x} - \mathbf{x}^b) + (\mathbf{y} - \mathbf{Hx})^\top \mathbf{R}^{-1} (\mathbf{y} - \mathbf{Hx})

where θ={B,R}\boldsymbol{\theta} = \{ \mathbf{B}, \mathbf{R}\}


In Practice

It doesn’t make sense to assume that we have an affine linear transformation if indeed our model cannot be described by a linear transformation. Instead, we can try to directly learn this from data. We could also directly estimate this from the data. Let’s take some coordinates, {Xobs,Yobs}={xobs,yobs}i=1N\{\mathbf{X}_{obs}, \mathbf{Y}_{obs} \} = \{\mathbf{x}_{obs}, \mathbf{y}_{obs} \}_{i=1}^N which are observed. We will learn a model that best fits the observations. And then we will find some interpolation method to fill in the gaps.

Historical Literature:

Modern Literature:


We can write an exact equation for the best least squares linear estimator, f(x)\boldsymbol{f}(\mathbf{x}):

f(x)=μx+i=1Dxj=1DyAij1Cxj(yiμxj)\boldsymbol{f}(\mathbf{x}^*)= \boldsymbol{\mu}_\mathbf{x} + \sum_{i=1}^{D_\mathbf{x}} \sum_{j=1}^{D_\mathbf{y}} \mathbf{A}_{ij}^{-1}\mathbf{C}_{\mathbf{x}j}(\mathbf{y}_i - \boldsymbol{\mu}_{\mathbf{x}j})

where Aij\mathbf{A}_{ij} is the covariance between the observation locations:

Aij=xobs,xobs+ϵ,ϵ\mathbf{A}_{ij}= \langle \mathbf{x}_{obs}, \mathbf{x'}_{obs} \rangle + \langle \boldsymbol{\epsilon}, \boldsymbol{\epsilon}' \rangle

and Cxj\mathbf{C}_{\mathbf{x}j} is the cross covariance for the locations to be estimated and the observation locations:

Cxj=x,xobs\mathbf{C}_{\mathbf{x}j} = \langle \mathbf{x}^*, \mathbf{x}_{obs} \rangle

Reformulation

We can write the full equation (in vector form) as follows:

f(x)=kX(x)(KXX+σ2I)1Yobs\boldsymbol{f}(\mathbf{x}^*) = \mathbf{k_{X}}(\mathbf{x}^*) (\mathbf{K_{XX}} + \sigma^2\mathbf{I})^{-1}\mathbf{Y}_{obs}

where:

We have a kernel function which is the correlation between the observation locations, x\mathbf{x}.

Kij=xobs,xobs\mathbf{K}_{ij}= \langle \mathbf{x}_{obs}, \mathbf{x'}_{obs} \rangle

where XobsRN×D\mathbf{X}_{obs} \in \mathbb{R}^{N \times D}.

If we only look at the elements dependent upon the observations, Yobs\mathbf{Y}_{obs}, we can call this, α\boldsymbol{\alpha}.

α=(Kxx+σ2I)1Yobs\boldsymbol{\alpha} = (\mathbf{K_{xx}} + \sigma^2\mathbf{I})^{-1}\mathbf{Y}_{obs}

where αRN×D\boldsymbol{\alpha} \in \mathbb{R}^{N \times D}. We can solve for α\boldsymbol{\alpha} exactly. To find the best free parameters, θ={σ2,α}\boldsymbol{\theta} = \{ \sigma^2, \boldsymbol{\alpha} \}, we can use cross-validation.

Note: This comes from kernel methods!


And now we can make predictions

m(x)=kX(x)α\begin{aligned} \boldsymbol{m}(\mathbf{x}^*) &= \boldsymbol{k}_{\mathbf{X}}(\mathbf{x}^*) \boldsymbol{\alpha} \\ \end{aligned}

and we can also estimate the covariance.

Σ2(x,x)=σ2+k(x,x)+kX(x)(KXX+σ2I)kX(x)\begin{aligned} \boldsymbol{\Sigma}^2(\mathbf{x}^*, \mathbf{x'}^*) &= \sigma^2 + \boldsymbol{k}(\mathbf{x}^*, \mathbf{x}^*) + \boldsymbol{k}_{\mathbf{X}}(\mathbf{x}^*) (\mathbf{K_{XX}} + \sigma^2\mathbf{I}) \boldsymbol{k}_{\mathbf{X}}(\mathbf{x}^*)^\top \end{aligned}

Notation


Kernel Functions

Linear Kernel

k(x,y)=xy\boldsymbol{k}(\mathbf{x},\mathbf{y}) = \mathbf{x}^\top \mathbf{y}

Gaussian Kernel (aka Radial Basis Function, Matern 32)

k(x,y)=exp(γxy22)\boldsymbol{k}(\mathbf{x},\mathbf{y}) = \exp \left( - \gamma||\mathbf{x} - \mathbf{y}||_2^2 \right)

Note: This is known as the Gaussian kernel if γ=1σ2\gamma = \frac{1}{\sigma^2}.

Laplacian Kernel

k(x,y)=exp(γxy1)\boldsymbol{k}(\mathbf{x},\mathbf{y}) = \exp \left( - \gamma||\mathbf{x} - \mathbf{y}||_1 \right)

Polynomial Kernel

k(x,y)=(γxy+c0)d\boldsymbol{k}(\mathbf{x},\mathbf{y}) = \left( \gamma\mathbf{x}^\top \mathbf{y} + c_0 \right)^d

Scaling

The bottleneck in the above methods is the inversion of the KXX1\mathbf{K_{XX}}^{-1} matrix which is order O(N3)\mathcal{O}(N^3).

We can assume that our kernel matrix KXX\mathbf{K_{XX}} can be approximated via a low-rank matrix:

KXXKXZKXZ\mathbf{K_{XX}} \approx \mathbf{K_{XZ}}\mathbf{K_{XZ}}^\top

where KXZRN×r\mathbf{K_{XZ}} \in \mathbb{R}^{N \times r}

Similarly, we can also write:

KXXKXZKZZ1KXZ\mathbf{K_{XX}} \approx \mathbf{K_{XZ}}\mathbf{K_{ZZ}}^{-1}\mathbf{K_{XZ}}^\top
kr(X)=\mathbf{k}_r (\mathbf{X}) =

So then we can rewrite our solution to be:

α=(KXZKXZ+σ2KZZ)KXZYobs\boldsymbol{\alpha} = \left( \mathbf{K_{XZ}}^\top \mathbf{K_{XZ}} + \sigma^2 \mathbf{K_{ZZ}} \right)^{\dagger} \mathbf{K_{XZ}}^\top \mathbf{Y}_{obs}

Now to make predictions, we can simply write:

m(x)=kr(x)α\boldsymbol{m}(\mathbf{x}^*) = \boldsymbol{k}_{r}(\mathbf{x}^*)\boldsymbol{\alpha}

This is much cheaper:

O(nm3+m3)<O(n3)\mathcal{O}(nm^3 + m^3) < \mathcal{O}(n^3)

Predictive Mean

m(x)=kzα\boldsymbol{m}(\mathbf{x}^*) = \boldsymbol{k}_{\mathbf{z}}\boldsymbol{\alpha}

Predictive Covariance

Σ2(x,x)=k(x,x)kz(x)KZZ1kz(x)+kz(x)(KZZ+σ2KZXKXZ)kz(x)\boldsymbol{\Sigma}^2(\mathbf{x}^*, \mathbf{x'}^*) = \boldsymbol{k}(\mathbf{x}^*, \mathbf{x'}^*) - \boldsymbol{k}_{\mathbf{z}}(\mathbf{x}^*)^\top \mathbf{K_{ZZ}}^{-1}\boldsymbol{k}_{\mathbf{z}}(\mathbf{x}^*)^\top + \boldsymbol{k}_{\mathbf{z}}(\mathbf{x}^*) \left( \mathbf{K_{ZZ}} + \sigma^{-2} \mathbf{K_{ZX}K_{XZ}} \right) \boldsymbol{k}_{\mathbf{z}}(\mathbf{x}^*)^\top

where:

Resources:



Potential Projects


Model

We have a data model that is non-parametric. It means that we fit a model that has many local fits where the number of parameters may exceed the number of data points.

yi=f(xi)+ϵi,i=1,2,,N\mathbf{y}_i = \boldsymbol{f}(\mathbf{x}_i) + \boldsymbol{\epsilon}_i, \hspace{10mm} i=1,2, \ldots, N

where:


Kernel Methods


Theory


References

They use different kernels with different model outputs to try and interpolate. They found that more expressive kernels with combinations of inputs provided better models.

Learning


Motivation


Software Planning


RFF


Scalable Computation



Spatio-Temporal Kernels


Connections


Markovian Gaussian Processes


Periodic Activation Functions


Kernel Functions