Skip to content
Source

Motivation for this Study

In this document, I will be looking at the motivation behind this study and why we would like to pursue this further.

Code Blocks
import sys, os
import warnings
import tqdm
import random
import pandas as pd
import numpy as np

import matplotlib
import matplotlib.pyplot as plt

# Insert path to model directory,.
cwd = os.getcwd()
path = f"{cwd}/../../src"
sys.path.insert(0, path)

# Insert path to package,.
pysim_path = f"/home/emmanuel/code/pysim/"
sys.path.insert(0, pysim_path)

# toy datasets
from data.toy import generate_dependence_data
from data.distribution import DataParams
from dataclasses import dataclass

# Kernel Dependency measure
from sklearn.preprocessing import StandardScaler
from sklearn.gaussian_process.kernels import RBF
from models.dependence import HSICModel
from pysim.kernel.utils import get_init_gammas, get_gamma_grid, estimate_sigma

import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use(['seaborn-talk'])
%matplotlib inline

warnings.filterwarnings('ignore') # get rid of annoying warnings
FIG_PATH = "/home/emmanuel/projects/2019_hsic_align/results/figures/1d_dataset/demo/"
%load_ext autoreload
%autoreload 2
def standardize_data(X, Y, standardize: bool=False):
    X = StandardScaler().fit_transform(X)
    Y = StandardScaler().fit_transform(Y)
    return X, Y

def get_sigma(X, Y, method: str='silverman', per_dimension: bool=False, separate_scales: bool=False):
    # sigma parameters
    subsample = None
    percent = 0.20
    random_state = 123

    sigma_X = estimate_sigma(
        X, 
        subsample=subsample,
        method=method,
        percent=percent,
        random_state=random_state,
        per_dimension=per_dimension
    )

    sigma_Y = estimate_sigma(
        Y, 
        subsample=subsample,
        method=method,
        percent=percent,
        random_state=random_state,
        per_dimension=per_dimension
    )

    if separate_scales:
        sigma_X = np.mean([sigma_X, sigma_Y])
        sigma_Y = np.mean([sigma_X, sigma_Y])
    return sigma_X, sigma_Y

def get_hsic(X, Y, scorer: str, sigma_X=None, sigma_Y=None):

    # init hsic model class

    hsic_model = HSICModel()
    # hsic model params
    if sigma_X is not None:

        hsic_model.kernel_X = RBF(sigma_X)
        hsic_model.kernel_Y = RBF(sigma_Y)

    # get hsic score
    hsic_val = hsic_model.get_score(X, Y, scorer)

    return hsic_val

1D Example

Code Block
# data params
dataset = 'sine'
num_points = 1000
seed = 123
noise = 0.1

# get dataset
X, Y = generate_dependence_data(
    dataset=dataset,
    num_points=num_points,
    seed=seed,
    noise_x=noise,
    noise_y=noise
)

# plot
fig, ax = plt.subplots()

ax.scatter(X[:100,:], Y[:100,:])
plt.tight_layout()
fig.savefig(FIG_PATH + f"demo_{dataset}.png")
plt.show()

png

Fig I: An example 1D Sine Curve.

Let's take a simple 1D distribution: a sine curve. It is clear that there is a nonlinear relationship between them that cannot be captured (well) by linear methods. We are interested in looking at the dependence between X and Y. We have the HSIC family of methods: HSIC, kernel alignment and centered kernel alignment. They are all very similar but there are some subtle differences. We will highlight them as we go through the overview. Let's take a generic approach and use the default HSIC, KA and CKA methods to try and estimate the dependence between X,Y. If we run the algorithm, we get the following results.

Question I - Which Algorithm?

Code Block
results_df = pd.DataFrame()
method = 'scott'
per_dimension = False
separate_scales = False


# sigma_X, sigma_y = get_sigma(
#     X, Y, 
#     method=method, 
#     per_dimension=per_dimension, 
#     separate_scales=separate_scales
# )
method = 'default'
sigma_X, sigma_Y = None, None
scorer = 'hsic'

results_df = results_df.append(pd.DataFrame({
    "hsic": [get_hsic(X, Y, 'hsic', sigma_X, sigma_Y)],   # Estimate HSIC
    "ka": [get_hsic(X, Y, 'ka', sigma_X, sigma_Y)],   # Estimate KA
    "cka": [get_hsic(X, Y, 'cka', sigma_X, sigma_Y)],   # Estimate CKA
},index=['Q1']),)
print(results_df.to_markdown())

hsic ka cka
Q1 0.0582356 0.688475 0.588434

Notice how all of the values are slightly difference. This is because of the composition of the methods. We can highlight the differences with a simple table.

Method Centered Kernel Normalized
HSIC Yes No
Kernel Alignment No Yes
Centered Kernel Alignment Yes No

So each method has a slightly different formulation but they are mostly the same. So now the next question is: how do we estimate the parameters of the kernel used? Well the default is simply \sigma=1.0 but we know that this won't do as the kernel depends on the parameters of the kernel. In this case we are using the most commonly used kernel: the Radial Basis Function (RBF). Since this is a 1D example, I will use some generic estimators called the "Silverman Rule" and "Scott Rule". These are very commonly found in packages like scipy.stats.gaussian_kde or statsmodels.nonparametric.bandwidth. They are mostly used for the Kernel Density Estimation (KDE) where we need a decent parameter to approximate the kernel to get a decent density estimate.

So what happens with the methods and the results?

Question II - Which Parameter Estimator?

Code Block
methods = ['scott', 'silverman', 'median']
per_dimension = False
separate_scales = True
results_df = pd.DataFrame()

for imethod in methods:
    sigma_X, sigma_Y = get_sigma(
        X, Y, 
        method=imethod, 
        per_dimension=per_dimension, 
        separate_scales=separate_scales
    )
    results_df = results_df.append(pd.DataFrame({
#         "sigma_x": [sigma_X],
#         "sigma_y": [sigma_Y],
        'Estimator': [imethod],  
        "hsic": [get_hsic(X, Y, 'hsic', sigma_X, sigma_Y)],   # Estimate HSIC
        "ka": [get_hsic(X, Y, 'ka', sigma_X, sigma_Y)],   # Estimate KA
        "cka": [get_hsic(X, Y, 'cka', sigma_X, sigma_Y)],   # Estimate CKA
    },index=['Q2']),)

print(results_df.to_markdown())
Estimator hsic ka cka
Q2 scott 0.0575482 0.660478 0.530685
Q2 silverman 0.0515751 0.6345 0.515583
Q2 median 0.066173 0.702005 0.556274

Question III - How do we estimate the length scale?

  • Use the same length scale?
  • Use different length scales?
  • Use a length scale per dimension (D>1)
Code Block
methods = ['scott', 'silverman', 'median']
per_dimension = False
separate_scales = [True, False]
results_df = pd.DataFrame()

for iscaler in separate_scales:
    for imethod in methods:
        sigma_X, sigma_Y = get_sigma(
            X, Y, 
            method=imethod, 
            per_dimension=per_dimension, 
            separate_scales=iscaler
        )
        results_df = results_df.append(pd.DataFrame({
    #         "sigma_x": [sigma_X],
            "separate": [iscaler],
            'Estimator': [imethod],  
            "hsic": [get_hsic(X, Y, 'hsic', sigma_X, sigma_Y)],   # Estimate HSIC
            "ka": [get_hsic(X, Y, 'ka', sigma_X, sigma_Y)],   # Estimate KA
            "cka": [get_hsic(X, Y, 'cka', sigma_X, sigma_Y)],   # Estimate CKA
        },index=['Q3']),)

print(results_df.to_markdown())
separate Estimator hsic ka cka
Q3 True scott 0.0575482 0.660478 0.530685
Q3 True silverman 0.0515751 0.6345 0.515583
Q3 True median 0.066173 0.702005 0.556274
Q3 False scott 0.0601095 0.696988 0.596866
Q3 False silverman 0.0524045 0.66827 0.577468
Q3 False median 0.0728568 0.739607 0.620757

Question IV - Standardize Data?

We could also standardize our data... This could actually change the size of each of the features which could eliminate the need to apply separate length scales.

Code Block
standardize = [True, False]
methods = ['scott', 'silverman', 'median']
per_dimension = False
separate_scales = [True, False]
results_df = pd.DataFrame()

for istandard in standardize:

    X_, Y_ = standardize_data(X, Y, istandard)
    for iscaler in separate_scales:
        for imethod in methods:
            sigma_X, sigma_Y = get_sigma(
                X_, Y_, 
                method=imethod, 
                per_dimension=per_dimension, 
                separate_scales=iscaler
            )
            results_df = results_df.append(pd.DataFrame({
                "standardize": [istandard],
                "separate": [iscaler],
                'Estimator': [imethod],  
                "hsic": [get_hsic(X_, Y_, 'hsic', sigma_X, sigma_Y)],   # Estimate HSIC
                "ka": [get_hsic(X_, Y_, 'ka', sigma_X, sigma_Y)],   # Estimate KA
                "cka": [get_hsic(X_, Y_, 'cka', sigma_X, sigma_Y)],   # Estimate CKA
            },index=['Q4']),)

print(results_df.to_markdown())
standardize separate Estimator hsic ka cka
Q4 True True scott 0.0601095 0.696988 0.596866
Q4 True True silverman 0.0524045 0.66827 0.577468
Q4 True True median 0.0729923 0.74078 0.623443
Q4 True False scott 0.0601095 0.696988 0.596866
Q4 True False silverman 0.0524045 0.66827 0.577468
Q4 True False median 0.0728568 0.739607 0.620757
Q4 False True scott 0.0601095 0.696988 0.596866
Q4 False True silverman 0.0524045 0.66827 0.577468
Q4 False True median 0.0729923 0.74078 0.623443
Q4 False False scott 0.0601095 0.696988 0.596866
Q4 False False silverman 0.0524045 0.66827 0.577468
Q4 False False median 0.0728568 0.739607 0.620757

Now we see that the values you get are quite different for all methods. What happens if we use different sigmas?

Todo

Show a plot of the different parameters and how much they vary. No need to see the actual origins. Just need to highlight the variance in the estimates.

Verdict

Well, hard to say as it depends on the parameters. Every researcher I've met who dealt with kernel methods seems to have a suggestion that they swear by but I never know who to follow. My thoughts is that we should use dedicated sigma values per dataset however, that still leaves us with other methods that we may want to try. So we're going to repeat the same experiment but with a 2D dataset and we will see that the difficult will increase again.


2D Example

For this experiment, we're going to take two 2D datasets each generated from a T-Student distribution. We will apply the same sequence as we did above and we will end the section by adding another option for picking the parameters.

Code Block
# initialize Data Params class
dataset = 'tstudent'
samples = 1_000
dimensions = 2
std = 5
nu = 8
trial = 1
standardize = False

# initialize params
example_params = DataParams(
    dataset=dataset,
    samples=samples,
    dimensions=dimensions,
    std=std,
    nu=nu,
    trial=trial,
    standardize=standardize
)

# generate some parameters
inputs = example_params.generate_data()
sns.jointplot(x=inputs.X, y=inputs.Y)
plt.tight_layout()
plt.savefig(FIG_PATH + f"demo_{dataset}.png")

png

Fig II: An example 2D T-Student distribution.

Question III (Revisited) - Different Length Scales?

Now we can revisit this question because we actually could estimate a different length scale depending upon the dimensionality. One problem with scott or Silverman's method is that it takes into account the entire dataset instead of having one estimate per feature.

Code Block
methods = ['scott', 'silverman', 'median']
per_dimension = False
separate_scales = [True, False]
separate_dimensions = [True, False]
results_df = pd.DataFrame()

for iscaler in separate_scales:
    for idim in separate_dimensions:
        for imethod in methods:
            sigma_X, sigma_Y = get_sigma(
                X, Y, 
                method=imethod, 
                per_dimension=idim, 
                separate_scales=iscaler
            )
            results_df = results_df.append(pd.DataFrame({
                "standardize": [istandard],
                "Separate Dimensions": [idim],
                "Separate Length Scales": [iscaler],
                'Param Estimator': [imethod],  
                "HSIC": [get_hsic(X, Y, 'hsic', sigma_X, sigma_Y)],   # Estimate HSIC
                "KA": [get_hsic(X, Y, 'ka', sigma_X, sigma_Y)],   # Estimate KA
                "CKA": [get_hsic(X, Y, 'cka', sigma_X, sigma_Y)],   # Estimate CKA
            },index=['Q3']),)

print(results_df.to_markdown())
standardize Separate Dimensions Separate Length Scales Param Estimator HSIC KA CKA
Q3 False True True scott 0.0575482 0.660478 0.530685
Q3 False True True silverman 0.0515751 0.6345 0.515583
Q3 False True True median 0.066173 0.702005 0.556274
Q3 False False True scott 0.0575482 0.660478 0.530685
Q3 False False True silverman 0.0515751 0.6345 0.515583
Q3 False False True median 0.066173 0.702005 0.556274
Q3 False True False scott 0.0601095 0.696988 0.596866
Q3 False True False silverman 0.0524045 0.66827 0.577468
Q3 False True False median 0.0728568 0.739607 0.620757
Q3 False False False scott 0.0601095 0.696988 0.596866
Q3 False False False silverman 0.0524045 0.66827 0.577468
Q3 False False False median 0.0728568 0.739607 0.620757

Q1-Q4

So now, let's look at all questions for the 2D data distribution

Code Block
standardize = [True, False]
methods = ['scott', 'silverman', 'median']
per_dimension = False
separate_scales = [True, False]
separate_dimensions = [True, False]
results_df = pd.DataFrame()

for istandard in standardize:

    X_, Y_ = standardize_data(X, Y, istandard)
    for iscaler in separate_scales:
        for idim in separate_dimensions:
            for imethod in methods:
                sigma_X, sigma_Y = get_sigma(
                    X_, Y_, 
                    method=imethod, 
                    per_dimension=idim, 
                    separate_scales=iscaler
                )
                results_df = results_df.append(pd.DataFrame({
                    "standardize": [istandard],
                    "Separate Dimensions": [idim],
                    "Separate Length Scales": [iscaler],
                    'Param Estimator': [imethod],  
                    "HSIC": [get_hsic(X_, Y_, 'hsic', sigma_X, sigma_Y)],   # Estimate HSIC
                    "KA": [get_hsic(X_, Y_, 'ka', sigma_X, sigma_Y)],   # Estimate KA
                    "CKA": [get_hsic(X_, Y_, 'cka', sigma_X, sigma_Y)],   # Estimate CKA
                },index=['Q4']),)

print(results_df.to_markdown())
standardize Separate Dimensions Separate Length Scales Param Estimator HSIC KA CKA
Q4 True True True scott 0.0601095 0.696988 0.596866
Q4 True True True silverman 0.0524045 0.66827 0.577468
Q4 True True True median 0.0729923 0.74078 0.623443
Q4 True False True scott 0.0601095 0.696988 0.596866
Q4 True False True silverman 0.0524045 0.66827 0.577468
Q4 True False True median 0.0729923 0.74078 0.623443
Q4 True True False scott 0.0601095 0.696988 0.596866
Q4 True True False silverman 0.0524045 0.66827 0.577468
Q4 True True False median 0.0728568 0.739607 0.620757
Q4 True False False scott 0.0601095 0.696988 0.596866
Q4 True False False silverman 0.0524045 0.66827 0.577468
Q4 True False False median 0.0728568 0.739607 0.620757
Q4 False True True scott 0.0601095 0.696988 0.596866
Q4 False True True silverman 0.0524045 0.66827 0.577468
Q4 False True True median 0.0729923 0.74078 0.623443
Q4 False False True scott 0.0601095 0.696988 0.596866
Q4 False False True silverman 0.0524045 0.66827 0.577468
Q4 False False True median 0.0729923 0.74078 0.623443
Q4 False True False scott 0.0601095 0.696988 0.596866
Q4 False True False silverman 0.0524045 0.66827 0.577468
Q4 False True False median 0.0728568 0.739607 0.620757
Q4 False False False scott 0.0601095 0.696988 0.596866
Q4 False False False silverman 0.0524045 0.66827 0.577468
Q4 False False False median 0.0728568 0.739607 0.620757

Verdict

For the distributions, it seemed to be a little more consistent but with higher dimensions and more samples, these estimators start to fail. But then, we still don't have good alternative estimators.

Todo

Show a plot of the different parameters and how much they vary. No need to see the actual origins. Just need to highlight the variance in the estimates.

What Now?

I will be looking at the following:

Options
Standardize Yes / No
Parameter Estimator Mean, Median, Silverman, etc
Center Kernel Yes / No
Normalized Score Yes / No