Kernel Measures of Similarity¶
Notation
- \mathbf{X} \in \mathbb{R}^{N \times D_\mathbf{x}} are samples from a multidimentionsal r.v. \mathcal{X}
- \mathbf{X} \in \mathbb{R}^{N \times D_\mathbf{y}} are samples from a multidimensional r.v. \mathcal{Y}
- K \in \mathbb{R}^{N \times N} is a kernel matrix.
- K_\mathbf{x} is a kernel matrix for the r.v. \mathcal{X}
- K_\mathbf{y} is a kernel matrix for the r.v. \mathcal{Y}
- K_\mathbf{xy} is the cross kernel matrix for the r.v. \mathcal{X,Y}
- \tilde{K} \in \mathbb{R}^{N \times N} is the centered kernel matrix.
Observations
- \mathbf{X},\mathbf{Y} can have different number of dimensions
- \mathbf{X},\mathbf{Y} must have different number of samples
Feature Map¶
We have a function \varphi(X) to map \mathcal{X} to some feature space \mathcal{F}.
Function Class¶
Reproducing Kernel Hilbert Space \mathcal{H} with kernel k.
Evaluation functionals
We can compute means via linearity
And empirically
Kernels¶
This allows us to not have to explicitly calculate \varphi(X). We just need an algorithm that calculates the dot product between them.
Reproducing Kernel Hilbert Space Notation¶
Reproducing Property
Equivalence between \phi(x) and k(x,\cdot).
Probabilities in Feature Space: The Mean Trick¶
Mean Embedding¶
Maximum Mean Discrepency (MMD)¶
Hilbert-Schmidt Independence Criterion (HSIC)¶
Given \mathbb{P} a Borel probability measure on \mathcal{X}, we can define a feature map \mu_P \in \mathcal{F}.
Given a positive definite kernel k(x,x'), we can define the expectation of the cross kernel as:
for x \sim P and q \sim Q. We can use the mean trick to define the following:
Covariance Measures¶
Uncentered Kernel¶
\text{cov}(\mathbf{X}, \mathbf{Y}) =||K_{\mathbf{xy}}||_\mathcal{F} =\langle K_\mathbf{x}, K_\mathbf{y} \rangle_\mathcal{F}
Centered Kernel¶
Hilbert-Schmidt Independence Criterion (HSIC)¶
Maximum Mean Discrepency (MMD)¶
Kernel Matrix Inversion¶
Sherman-Morrison-Woodbury¶
Matrix Sketch
Kernel Approximation¶
Random Fourier Features¶
Nystrom Approximation¶
According to ... the Nystroem approximation works better when you want features that are data dependent. The RFF method assumes a basis function and it is irrelevant to the data. It's merely projecting the data into the independent basis. The Nystroem approximation forms the basis through the data itself.
Resources
- A Practical Guide to Randomized Matrix Computations with MATLAB Implementations - Shusen Wang (2015) - axriv
Structured Kernel Interpolation¶
Correlation Measures¶
Uncentered Kernel¶
Kernel Alignment (KA)¶
In the Literature
- Kernel Alignment
Uncentered Kernel¶
Centered Kernel Alignment (cKA)¶
\rho(\mathbf{X}, \mathbf{Y}) =\frac{\langle \tilde{K}_\mathbf{x}, \tilde{K}_\mathbf{y} \rangle_\mathcal{F}}{||\tilde{K}_\mathbf{x}||_\mathcal{F}||\tilde{K}_\mathbf{y}||_\mathcal{F}}
In the Literature
- Centered Kernel Alignment
Supplementary¶
Ideas¶
What happens when?
- HS Norm of Noisy Matrix
- HS Norm of PCA components