This aims at finding the Best Linear Unbiased Estimator (BLUE).
Problem Setting¶
x=[lon1,…,lonDlat,lat1,…,latDlon,,time1,…,timeDtime]∈RDlat×Dlon×Dtime For example, if we have a spatial lat-lon grid of 30x30
points and 30
time steps, then the vector is 30x30x30
which is 27,000
-dimensional vector!
Optimal Interpolation (OI) seems to be the golden standard for gap-filling SSH fields with missing data. Something that is very prevalent for Satellite sensors. However, the standard OI method
μ~xtΣ~xtK=xt+K(yt−Hxt)=(I−KH)Σx=PH(HPH⊤+R)−1 The K is known as the Kalman gain. The optimal K is given by:
K=ΣxyΣyy−1 where Σyy is the covariance between the observations and Σxy is the cross-covariance between the state and the observations.
Notice how this assumes that interpolation can be done via a linear operator, H. The Kalman gain, K, represents the best interpolation via this linear problem. However, this is often insufficient in many scenarios where the data does not exhibit linear dynamics.
Data-Driven OI¶
So in practice, it’s often not possible
Notation¶
Symbol | Meaning |
---|
x∈R | scalar |
x∈RD | vector |
x∈RN×D | a matrix |
KXX:=K∈RD | the kernel matrix |
k(X,x)=kX(x)∈RD | the cross kernel matrix for some data, X∈RN×D, and some new vector, x. |
x∈RD | vector |
Setting¶
x(t)y(t)=M(x(t−1))+η(t)=H(x(t))+ϵ(t) Discrete Setting
xkyk=xk−1+ϵx=Hxk+ϵy where:
- ϵx∼N(0,P)
- ϵy∼N(0,R)
O.I. Equations¶
Kμ~xtΣ~xt=PH(HPH⊤+R)−1=xt+K(yt−Hxt)=(I−KH)Σx where:
- y - observations
- x - prior input
- K - Kalman gain
- B - noise covariance matrix from the
- R - noise covariance matrix from the observations.
Cost Function¶
Lorenc (1986) showed that OI is closely related to the 3D-Var variational data assimilation
L(θ)=(x−xb)⊤B−1(x−xb)+(y−Hx)⊤R−1(y−Hx) where θ={B,R}
In Practice¶
It doesn’t make sense to assume that we have an affine linear transformation if indeed our model cannot be described by a linear transformation. Instead, we can try to directly learn this from data. We could also directly estimate this from the data. Let’s take some coordinates, {Xobs,Yobs}={xobs,yobs}i=1N which are observed. We will learn a model that best fits the observations. And then we will find some interpolation method to fill in the gaps.
Historical Literature:
- Bretherton et al. 1976
- McIntosh, 1990
- Le Traon et al. 1998
Modern Literature:
- Kernel Methods
- Kernel Interpolation
We can write an exact equation for the best least squares linear estimator, f(x):
f(x∗)=μx+i=1∑Dxj=1∑DyAij−1Cxj(yi−μxj) where Aij is the covariance between the observation locations:
Aij=⟨xobs,x′obs⟩+⟨ϵ,ϵ′⟩ and Cxj is the cross covariance for the locations to be estimated and the observation locations:
Cxj=⟨x∗,xobs⟩
We can write the full equation (in vector form) as follows:
f(x∗)=kX(x∗)(KXX+σ2I)−1Yobs where:
- KXX - the coordinates for where we have observations, Yobs∈RN×D.
- kX(⋅) - the cross-kernel for the new coordinates, x∗∈RD, and the old coordinates, X∈RN×D.
We have a kernel function which is the correlation between the observation locations, x.
Kij=⟨xobs,x′obs⟩ where Xobs∈RN×D.
If we only look at the elements dependent upon the observations, Yobs, we can call this, α.
α=(Kxx+σ2I)−1Yobs where α∈RN×D. We can solve for α exactly. To find the best free parameters, θ={σ2,α}, we can use cross-validation.
Note: This comes from kernel methods!
And now we can make predictions
m(x∗)=kX(x∗)α and we can also estimate the covariance.
Σ2(x∗,x′∗)=σ2+k(x∗,x∗)+kX(x∗)(KXX+σ2I)kX(x∗)⊤
Notation¶
- k(x,y):X×Y→R - kernel function that takes in two vectors and returns a scalar value.
- k(x∗):X×:→RD
- KXX∈RDx×Dx - kernel matrix (gram matrix) which is the result of
- KXY∈RDx×Dy - cross kernel matrix
Kernel Functions¶
Linear Kernel
k(x,y)=x⊤y Gaussian Kernel (aka Radial Basis Function, Matern 32)
k(x,y)=exp(−γ∣∣x−y∣∣22) Note: This is known as the Gaussian kernel if γ=σ21.
Laplacian Kernel
k(x,y)=exp(−γ∣∣x−y∣∣1) Polynomial Kernel
k(x,y)=(γx⊤y+c0)d
Scaling¶
The bottleneck in the above methods is the inversion of the KXX−1 matrix which is order O(N3).
We can assume that our kernel matrix KXX can be approximated via a low-rank matrix:
KXX≈KXZKXZ⊤ where KXZ∈RN×r
Similarly, we can also write:
KXX≈KXZKZZ−1KXZ⊤ kr(X)= So then we can rewrite our solution to be:
α=(KXZ⊤KXZ+σ2KZZ)†KXZ⊤Yobs Now to make predictions, we can simply write:
m(x∗)=kr(x∗)α This is much cheaper:
O(nm3+m3)<O(n3) Predictive Mean
m(x∗)=kzα Predictive Covariance
Σ2(x∗,x′∗)=k(x∗,x′∗)−kz(x∗)⊤KZZ−1kz(x∗)⊤+kz(x∗)(KZZ+σ−2KZXKXZ)kz(x∗)⊤ where:
- KXX∈RN×N
- KZZ∈Rr×r
- KZX∈Rr×N
- KXZ∈RN×r
- kZ(x)∈Rr
Resources:
- Less is More: Nystr ̈om Computational Regularization
- Using Nystrom to Speed Up Kernel Machines
- Linear -> Nystrom Approximation
- Polynomial -> Nystrom / TensorSketech
- Gaussian -> Nystrom / RFF
Potential Projects¶
- Kernel Methods - solve exactly with cross validation for the best parameters
- Gaussian Processes - solve with maximum likelihood
- Kernel Functions - explore different kernel functions and see how they work (e.g. linear, Gaussian, Laplacian, Haversine, Matern Family)
- Sparse Methods (Kernel) - use kernel approximation schemes (e.g. Nystrom, RFF)
- Sparse GP Methods - use inducing points to see how they work.
Model¶
We have a data model that is non-parametric. It means that we fit a model that has many local fits where the number of parameters may exceed the number of data points.
yi=f(xi)+ϵi,i=1,2,…,N where:
- f(⋅) - the regression function
- xi - the sampling position, e.g. latitude, longitude, and/or time
- yi - the sample
- ϵi - zero mean, iid noise
Kernel Methods¶
- Connections and Equivalences between the Nyström Method and Sparse Variational Gaussian Processes - Arxiv
- Gaussian Processes for Machine Learning - PDF
- PyKrige
- Spatial Analysis Made Easy with Linear Regression and Kernels - PDF
- RFFs - blog
- GP Regression - blog | blog
- An exact kernel framework for spatio-temporal dynamics - arxiv
- 4DVarNet - Broadcast | Loops | Batch
- Spatio-Temporal VGPs - [arxiv](Spatio-Temporal Variational Gaussian Processes)
- Space-Time KErnels - prexi
- Falkon - python
- Kernel methods & sparse methods for computer vision - bach - prezi
- Kernel methods through the roof: handling billions of points efficiently - arxiv
- KRR Demo - sklearn
Theory
- Machine Learning with Kernel Methods - Slides
- Kernel Ridge Regression Course - Class
References
- Gaussian Process Machine Learning and Kriging for Groundwater Salinity Interpolation - Cui et al (2021) - Paper
They use different kernels with different model outputs to try and interpolate. They found that more expressive kernels with combinations of inputs provided better models.
Learning
- How to Scale Up Kernel Methods to Be As Good As Deep Neural Nets - Lu et al (2015)
- On the Benefits of Large Learning Rates for Kernel Methods - Beugnot et al (2022)
- Learning Rate Annealing Can Provably Help Generalization, Even for Convex Problems - Nakkiran et al (2020)
Motivation
- Demo for Scalable Kernel Features w/ Sklearn - Blog
Software Planning
RFF
Scalable Computation
- Kernel methods through the roof: handling billions of points efficiently - Meanti et al (2020) - arxiv | video
- Efficient Hyperparameter Tuning for Large Scale Kernel Ridge Regression - Meanti et al (2022) - arxiv
- Scalable Kernel Methods via Doubly Stochastic Gradients - Dai et al (2015)
- Making Large-Scale Nystr¨om Approximation Possible - Li et al (2010) | Blog
- Large-Scale Nyström Kernel Matrix Approximation Using Randomized SVD - Li et al (2014)
- Practical Algorithms for Latent Variable Models - Gundersen (Thesis) (2021) - thesis
Spatio-Temporal Kernels
- Kernel Regression for Image Processing and Reconstruction - Takeda et al (2007) - pdf
- Composite Kernels for HSI classification - Gustau (2006) - Paper
Connections
- Deep Gaussian Markov Random Fields - Siden & Lindsten (2020) - arxiv
Markovian Gaussian Processes
- Spatio-Temporal Variational Gaussian Processes - Hamelijnick et al (2021) - arxiv (JAX)
- Variational Gaussian Process State-Space Models - Frigola et al (2014) - arxiv
Periodic Activation Functions
- NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis - Mildenhall et al (2020)
- Implicit Neural Representations with Periodic Activation Functions - Sitzmann et al (2020) - arxiv | page | eccv vid | blog
Kernel Functions¶