Optimal Interpolation#
This aims at finding the Best Linear Unbiased Estimator (BLUE).
Problem Setting#
For example, if we have a spatial lat-lon grid of 30x30
points and 30
time steps, then the vector is 30x30x30
which is 27,000
-dimensional vector!
Optimal Interpolation (OI) seems to be the golden standard for gap-filling SSH fields with missing data. Something that is very prevalent for Satellite sensors. However, the standard OI method
The \(\mathbf{K}\) is known as the Kalman gain. The optimal \(\mathbf{K}\) is given by:
where \(\boldsymbol{\Sigma}_{\mathbf{yy}}\) is the covariance between the observations and \(\boldsymbol{\Sigma}_{\mathbf{xy}}\) is the cross-covariance between the state and the observations.
Notice how this assumes that interpolation can be done via a linear operator, \(\mathbf{H}\). The Kalman gain, \(\mathbf{K}\), represents the best interpolation via this linear problem. However, this is often insufficient in many scenarios where the data does not exhibit linear dynamics.
Data-Driven OI#
So in practice, it’s often not possible
Notation#
Symbol |
Meaning |
---|---|
\(x \in \mathbb{R}\) |
scalar |
\(\mathbf{x} \in \mathbb{R}^{D}\) |
vector |
\(\mathbf{x} \in \mathbb{R}^{N \times D}\) |
a matrix |
\(\mathbf{K}_{\mathbf{XX}} := \mathbf{K} \in \mathbb{R}^{D}\) |
the kernel matrix |
\(\boldsymbol{k}(\mathbf{X}, \mathbf{x}) = \boldsymbol{k}_{\mathbf{X}}(\mathbf{x}) \in \mathbb{R}^{D}\) |
the cross kernel matrix for some data, \(\mathbf{X} \in \mathbf{R}^{N \times D}\), and some new vector, \(\mathbf{x}\). |
\(\mathbf{x} \in \mathbb{R}^{D}\) |
vector |
Setting#
Discrete Setting
where:
\(\boldsymbol{\epsilon}_{\mathbf{x}} \sim \mathcal{N}(\mathbf{0}, \mathbf{P})\)
\(\boldsymbol{\epsilon}_{\mathbf{y}} \sim \mathcal{N}(\mathbf{0}, \mathbf{R})\)
O.I. Equations#
where:
\(\mathbf{y}\) - observations
\(\mathbf{x}\) - prior input
\(\mathbf{K}\) - Kalman gain
\(\mathbf{B}\) - noise covariance matrix from the
\(\mathbf{R}\) - noise covariance matrix from the observations.
Cost Function#
Lorenc (1986) showed that OI is closely related to the 3D-Var variational data assimilation
where \(\boldsymbol{\theta} = \{ \mathbf{B}, \mathbf{R}\}\)
In Practice#
It doesn’t make sense to assume that we have an affine linear transformation if indeed our model cannot be described by a linear transformation. Instead, we can try to directly learn this from data. We could also directly estimate this from the data. Let’s take some coordinates, \(\{\mathbf{X}_{obs}, \mathbf{Y}_{obs} \} = \{\mathbf{x}_{obs}, \mathbf{y}_{obs} \}_{i=1}^N\) which are observed. We will learn a model that best fits the observations. And then we will find some interpolation method to fill in the gaps.
Historical Literature:
Bretherton et al. 1976
McIntosh, 1990
Le Traon et al. 1998
Modern Literature:
Kernel Methods
Kernel Interpolation
We can write an exact equation for the best least squares linear estimator, \(\boldsymbol{f}(\mathbf{x})\):
where \(\mathbf{A}_{ij}\) is the covariance between the observation locations:
and \(\mathbf{C}_{\mathbf{x}j}\) is the cross covariance for the locations to be estimated and the observation locations:
Reformulation#
We can write the full equation (in vector form) as follows:
where:
\(\mathbf{K_{XX}}\) - the coordinates for where we have observations, \(\mathbf{Y}_{obs} \in \mathbb{R}^{N \times D}\).
\(\mathbf{k_{X}}(\cdot)\) - the cross-kernel for the new coordinates, \(\mathbf{x}^* \in \mathbb{R}^{D}\), and the old coordinates, \(\mathbf{X} \in \mathbb{R}^{N\times D}\).
We have a kernel function which is the correlation between the observation locations, \(\mathbf{x}\).
where \(\mathbf{X}_{obs} \in \mathbb{R}^{N \times D}\).
If we only look at the elements dependent upon the observations, \(\mathbf{Y}_{obs}\), we can call this, \(\boldsymbol{\alpha}\).
where \(\boldsymbol{\alpha} \in \mathbb{R}^{N \times D}\). We can solve for \(\boldsymbol{\alpha}\) exactly. To find the best free parameters, \(\boldsymbol{\theta} = \{ \sigma^2, \boldsymbol{\alpha} \}\), we can use cross-validation.
Note: This comes from kernel methods!
And now we can make predictions
and we can also estimate the covariance.
Notation#
\(\boldsymbol{k}(\mathbf{x}, \mathbf{y}): \mathcal{X}\times \mathcal{Y} \rightarrow \mathbb{R}\) - kernel function that takes in two vectors and returns a scalar value.
\(\mathbf{k}(\mathbf{x}_*): \mathcal{X} \times : \rightarrow \mathbb{R}^{D}\)
\(\mathbf{K_{XX}} \in \mathbb{R}^{D_{\mathbf{x}} \times D_\mathbf{x}}\) - kernel matrix (gram matrix) which is the result of
\(\mathbf{K_{XY}} \in \mathbb{R}^{D_{\mathbf{x}} \times D_\mathbf{y}}\) - cross kernel matrix
Kernel Functions#
Linear Kernel
Gaussian Kernel (aka Radial Basis Function, Matern 32)
Note: This is known as the Gaussian kernel if \(\gamma = \frac{1}{\sigma^2}\).
Laplacian Kernel
Polynomial Kernel
Scaling#
The bottleneck in the above methods is the inversion of the \(\mathbf{K_{XX}}^{-1}\) matrix which is order \(\mathcal{O}(N^3)\).
We can assume that our kernel matrix \(\mathbf{K_{XX}}\) can be approximated via a low-rank matrix:
where \(\mathbf{K_{XZ}} \in \mathbb{R}^{N \times r}\)
Similarly, we can also write:
So then we can rewrite our solution to be:
Now to make predictions, we can simply write:
This is much cheaper:
Predictive Mean
Predictive Covariance
where:
\(\mathbf{K_{\mathbf{XX}}} \in \mathbb{R}^{N \times N}\)
\(\mathbf{K_{\mathbf{ZZ}}} \in \mathbb{R}^{r \times r}\)
\(\mathbf{K_{\mathbf{ZX}}} \in \mathbb{R}^{r \times N}\)
\(\mathbf{K_{\mathbf{XZ}}} \in \mathbb{R}^{N \times r}\)
\(\boldsymbol{k}_{\mathbf{Z}}(\mathbf{x}) \in \mathbb{R}^{r}\)
Resources:
Less is More: Nystr ̈om Computational Regularization
Using Nystrom to Speed Up Kernel Machines
Linear -> Nystrom Approximation
Polynomial -> Nystrom / TensorSketech
Gaussian -> Nystrom / RFF
Potential Projects#
Kernel Methods - solve exactly with cross validation for the best parameters
Gaussian Processes - solve with maximum likelihood
Kernel Functions - explore different kernel functions and see how they work (e.g. linear, Gaussian, Laplacian, Haversine, Matern Family)
Sparse Methods (Kernel) - use kernel approximation schemes (e.g. Nystrom, RFF)
Sparse GP Methods - use inducing points to see how they work.
Model#
We have a data model that is non-parametric. It means that we fit a model that has many local fits where the number of parameters may exceed the number of data points.
where:
\(\boldsymbol{f}(\cdot) \) - the regression function
\(\mathbf{x}_i\) - the sampling position, e.g. latitude, longitude, and/or time
\(\mathbf{y}_i\) - the sample
\(\boldsymbol{\epsilon}_i\) - zero mean, iid noise
Kernel Methods#
Connections and Equivalences between the Nyström Method and Sparse Variational Gaussian Processes - Arxiv
Gaussian Processes for Machine Learning - PDF
Spatial Analysis Made Easy with Linear Regression and Kernels - PDF
RFFs - blog
An exact kernel framework for spatio-temporal dynamics - arxiv
Spatio-Temporal VGPs - [arxiv](Spatio-Temporal Variational Gaussian Processes)
Space-Time KErnels - prexi
Falkon - python
Kernel methods & sparse methods for computer vision - bach - prezi
Kernel methods through the roof: handling billions of points efficiently - arxiv
KRR Demo - sklearn
Theory
References
Gaussian Process Machine Learning and Kriging for Groundwater Salinity Interpolation - Cui et al (2021) - Paper
They use different kernels with different model outputs to try and interpolate. They found that more expressive kernels with combinations of inputs provided better models.
Learning
How to Scale Up Kernel Methods to Be As Good As Deep Neural Nets - Lu et al (2015)
On the Benefits of Large Learning Rates for Kernel Methods - Beugnot et al (2022)
Learning Rate Annealing Can Provably Help Generalization, Even for Convex Problems - Nakkiran et al (2020)
Motivation
Demo for Scalable Kernel Features w/ Sklearn - Blog
Software Planning
GPFlow Sparse GP for Spatio-Temporal Data - Demo
SKIP Example - Demo
LOO GPyTorch - Code
LOO Sklearn - Code
RFF
Sampling RFF Efficiently - Blog | VFE GP - Blog
Scalable Computation
Kernel methods through the roof: handling billions of points efficiently - Meanti et al (2020) - arxiv | video
Efficient Hyperparameter Tuning for Large Scale Kernel Ridge Regression - Meanti et al (2022) - arxiv
Scalable Kernel Methods via Doubly Stochastic Gradients - Dai et al (2015)
Making Large-Scale Nystr¨om Approximation Possible - Li et al (2010) | Blog
Large-Scale Nyström Kernel Matrix Approximation Using Randomized SVD - Li et al (2014)
Practical Algorithms for Latent Variable Models - Gundersen (Thesis) (2021) - thesis
Kernel Ridge Regression - Falkon | Regression | Hyperparam | Auto Hyperparam | Custom Kernel | Custom Losses
Kernel Interpolation - KeOps Demo (Numpy/PyTorch)
Kernel Solve - KeOps Demo (Numpy/PyTorch)
Block-Matrices Reductions - KeOps Demo (Numpy/PyTorch)
Spatio-Temporal Kernels
Kernel Regression for Image Processing and Reconstruction - Takeda et al (2007) - pdf
Composite Kernels for HSI classification - Gustau (2006) - Paper
Connections
Deep Gaussian Markov Random Fields - Siden & Lindsten (2020) - arxiv
Markovian Gaussian Processes
Spatio-Temporal Variational Gaussian Processes - Hamelijnick et al (2021) - arxiv (JAX)
Variational Gaussian Process State-Space Models - Frigola et al (2014) - arxiv
Periodic Activation Functions