Kernel Parameter Estimation¶
Motivation¶
- What is Similarity?
- Why HSIC?
- The differences between HSIC
- The Problems with high-dimensional data
Research Questions¶
Demo Notebook
See this notebook for a full break down of each research question and why it's important and possibly difficult.
1. Which Scorer should we use?¶
We are looking at different "HSIC scorers" because they all vary in terms of whether they center the kernel matrix or if they normalize the score via the norm of the individual kernels.
Different Scorers
Notice: we have the centered kernels, K_xH and no normalization.
Notice: We have the uncentered kernels and a normalization factor.
Notice: We have the centered kernels and a normalization factor.
2. Which Estimator should we use?¶
Example Estimators
Notice: This method doesn't take into account the data size.
Notice: This method doesn't take into account the data size.
The heuristic is the 'mean distance between the points of the domain'. The full formula is:
where H_n = \text{Med}\left\{ ||X_{n,i} - X_{n,j}||^2 | 1 \leq i < j \leq n \right\} and \text{Med} is the empirical median. We can also use the Mean as well. We can obtain this by:
- Calculating the squareform euclidean distance of all points in our dataset
- Order them in increasing order
- Set H_n to be the central element if n(n-1)/2 is odd or the mean if n(n-1)/2 is even.
Note: some authors just use \sqrt{H_n}.
This is a distance measure that is new to me. It is the median/mean of the distances to the k-th neighbour of the dataset. So essentially, we take the same as the median except we take the kth-distance for each data-point and take the median of that.
- Calculate the squareform of the matrix
- Sort the matrix in ascending order
- Take the kth distance
- Take the median or mean of this columns
3. Should we use different length scales or the same?¶
Details
We can estimate 1 sigma per dataset \sigma_X,\sigma_Y or just 1 sigma \sigma_{XY}.
We can estimate 1 sigma per dataset per dimension \sigma_{X_d}, \sigma_{Y_d}.
4. Should we standardize our data?¶
5. Summary of Parameters¶
Options | |
---|---|
Standardize | Yes / No |
Parameter Estimator | Mean, Median, Silverman, etc |
Center Kernel | Yes / No |
Normalized Score | Yes / No |
Kernel | RBF / ARD |
Experiments¶
Walk-Throughs¶
This is the walk-through where I go step by step and show how I implemented everything. This is mainly for code review purposes but it also has some nuggets.
- 1.0 - Estimating Sigma
In this notebook, I show how one can estimate sigma using different heuristics in the literature.
- 2.0 - Estimating HSIC
I show how we can estimate HSIC using some of the main methods in the literature.
- 3.0 - Multivariate Distribution
I show how we can apply this to large multivariate data and create a large scale parameter search
- 4.1 - Best Parameters
This is part II where I show some preliminary results for which methods are better for the Gaussian distribution.
- 4.2 - Best Parameters
This is part II where I show some preliminary results for which methods are better for the T-Student distribution.
- 5.0 - Fitting Mutual Information
I show how the centered kernel alignment best approximates the Gaussian distribution.
Parameter Grid - 1D Data¶
Parameter Grid - nD Data¶
Demo Notebook
Mutual Information vs HSIC scores¶
Demo Notebook
Results¶
Take-Home Message I¶
The median distance seems to be fairly robust in settings with different samples and dimensions. Scott and Silverman should probably avoided if you are not going to estimate the the parameter per feature.
Take-Home Message II¶
It appears that the centered kernel alignment (CKA) method is the most consistent when we compare the score versus the mutual information of known distributions. HSIC has some consistency but not entirely. The KA algorithm has no consistency whatsoever; avoid using this method for unsupervised problems.