Source

Kernel Parameter Estimation¶

Motivation¶

What is Similarity?
Why HSIC?
The differences between HSIC
The Problems with high-dimensional data

Research Questions¶

Demo Notebook

See this notebook for a full break down of each research question and why it's important and possibly difficult.

1. Which Scorer should we use?¶

We are looking at different "HSIC scorers" because they all vary in terms of whether they center the kernel matrix or if they normalize the score via the norm of the individual kernels.

Different Scorers

HSIC

$\text{HSIC} = \frac{1}{n(n-1)}\langle K_xH,K_yH \rangle_F$

Notice: we have the centered kernels, $K_xH$ and no normalization.

Kernel Alignment

$\text{KA} = \frac{\langle K_x,K_y \rangle_F}{||K_x||_F||K_y||_F}$

Notice: We have the uncentered kernels and a normalization factor.

Centered Kernel Alignment

$\text{cKA} = \frac{\langle K_xH,K_yH \rangle_F}{||K_xH||_F||K_yH||_F}$

Notice: We have the centered kernels and a normalization factor.

2. Which Estimator should we use?¶

Example Estimators

Scott

$\text{scott} = N^{- \frac{1}{D + 4}}$

Notice: This method doesn't take into account the data size.

Silverman

$\text{silverman} = \left( \frac{N * (D+2)}{4} \right)^{- \frac{1}{D + 4}}$

Notice: This method doesn't take into account the data size.

Median/Mean Distance

The heuristic is the 'mean distance between the points of the domain'. The full formula is:

$\nu = \sqrt{\frac{H_n}{2}}$

where $H_n = \text{Med}\left\{ ||X_{n,i} - X_{n,j}||^2 | 1 \leq i < j \leq n \right\}$ and $\text{Med}$ is the empirical median. We can also use the Mean as well. We can obtain this by:

Calculating the squareform euclidean distance of all points in our dataset
Order them in increasing order
Set $H_n$ to be the central element if $n(n-1)/2$ is odd or the mean if $n(n-1)/2$ is even.

Note: some authors just use $\sqrt{H_n}$ .

Median Kth Distance

This is a distance measure that is new to me. It is the median/mean of the distances to the $k-th$ neighbour of the dataset. So essentially, we take the same as the median except we take the $kth$ -distance for each data-point and take the median of that.

Calculate the squareform of the matrix
Sort the matrix in ascending order
Take the kth distance
Take the median or mean of this columns

3. Should we use different length scales or the same?¶

Details

RBF Kernel

$K(\mathbf{x,y}) = \exp\left(-\frac{\left|\left|\mathbf{x} - \mathbf{y}\right|\right|^2_2}{2\sigma^2}\right)$

We can estimate 1 sigma per dataset $\sigma_X,\sigma_Y$ or just 1 sigma $\sigma_{XY}$ .

ARD Kernel

$K(\mathbf{x,y}) = \exp\left(-\frac{1}{2}\left|\left|\frac{\mathbf{x}}{\sigma_d} - \frac{\mathbf{y}}{\sigma_d}\right|\right|^2_2\right)$

We can estimate 1 sigma per dataset per dimension $\sigma_{X_d}, \sigma_{Y_d}$ .

4. Should we standardize our data?¶

$\bar{\mathbf{x}} = \frac{\mathbf{x} - \mu_\mathbf{x}}{\sigma_\mathbf{x}}$

5. Summary of Parameters¶

	Options
Standardize	Yes / No
Parameter Estimator	Mean, Median, Silverman, etc
Center Kernel	Yes / No
Normalized Score	Yes / No
Kernel	RBF / ARD

Experiments¶

Walk-Throughs¶

This is the walk-through where I go step by step and show how I implemented everything. This is mainly for code review purposes but it also has some nuggets.

1.0 - Estimating Sigma

In this notebook, I show how one can estimate sigma using different heuristics in the literature.
2.0 - Estimating HSIC

I show how we can estimate HSIC using some of the main methods in the literature.
3.0 - Multivariate Distribution

I show how we can apply this to large multivariate data and create a large scale parameter search
4.1 - Best Parameters

This is part II where I show some preliminary results for which methods are better for the Gaussian distribution.
4.2 - Best Parameters

This is part II where I show some preliminary results for which methods are better for the T-Student distribution.
5.0 - Fitting Mutual Information

I show how the centered kernel alignment best approximates the Gaussian distribution.

Parameter Grid - 1D Data¶

Notebook

Parameter Grid - nD Data¶

Demo Notebook

Fitting Mutual Information

Mutual Information vs HSIC scores¶

Demo Notebook

Fitting Mutual Information

Results¶

Take-Home Message I¶

The median distance seems to be fairly robust in settings with different samples and dimensions. Scott and Silverman should probably avoided if you are not going to estimate the the parameter per feature.

Take-Home Message II¶

It appears that the centered kernel alignment (CKA) method is the most consistent when we compare the score versus the mutual information of known distributions. HSIC has some consistency but not entirely. The KA algorithm has no consistency whatsoever; avoid using this method for unsupervised problems.