Demo I - Different Cases¶

In this document, I will be looking at the motivation behind this study and why we would like to pursue this further.

import sys, os
import warnings
import tqdm
import random
import pandas as pd
import numpy as np

import matplotlib
import matplotlib.pyplot as plt

# Insert path to model directory,.
cwd = os.getcwd()
path = f"{cwd}/../../src"
sys.path.insert(0, path)

# Insert path to package,.
pysim_path = f"/home/emmanuel/code/pysim/"
sys.path.insert(0, pysim_path)

# toy datasets
from data.toy import generate_dependence_data
from data.distribution import DataParams
from dataclasses import dataclass

# Kernel Dependency measure
from sklearn.preprocessing import StandardScaler
from sklearn.gaussian_process.kernels import RBF
from models.dependence import HSICModel
from pysim.kernel.utils import get_init_gammas, get_gamma_grid, estimate_sigma

import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use(['seaborn-talk'])
%matplotlib inline

warnings.filterwarnings('ignore') # get rid of annoying warnings

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

FIG_PATH = "/home/emmanuel/projects/2019_hsic_align/results/figures/1d_dataset/demo/"

Estimating Sigma & HSIC¶

def standardize_data(X, Y, standardize: bool=False):
    X = StandardScaler().fit_transform(X)
    Y = StandardScaler().fit_transform(Y)
    return X, Y

def get_sigma(X, Y, method: str='silverman', per_dimension: bool=False, separate_scales: bool=False):
    # sigma parameters
    subsample = None
    percent = 0.20
    random_state = 123

    sigma_X = estimate_sigma(
        X, 
        subsample=subsample,
        method=method,
        percent=percent,
        random_state=random_state,
        per_dimension=per_dimension
    )

    sigma_Y = estimate_sigma(
        Y, 
        subsample=subsample,
        method=method,
        percent=percent,
        random_state=random_state,
        per_dimension=per_dimension
    )

    if separate_scales:
        sigma_X = np.mean([sigma_X, sigma_Y])
        sigma_Y = np.mean([sigma_X, sigma_Y])
    return sigma_X, sigma_Y

def get_hsic(X, Y, scorer: str, sigma_X=None, sigma_Y=None):

    # init hsic model class

    hsic_model = HSICModel()
    # hsic model params
    if sigma_X is not None:

        hsic_model.kernel_X = RBF(sigma_X)
        hsic_model.kernel_Y = RBF(sigma_Y)

    # get hsic score
    hsic_val = hsic_model.get_score(X, Y, scorer)

    return hsic_val

Data I - 1D Dataset¶

# data params
dataset = 'sine'
num_points = 1000
seed = 123
noise = 0.1

# get dataset
X, Y = generate_dependence_data(
    dataset=dataset,
    num_points=num_points,
    seed=seed,
    noise_x=noise,
    noise_y=noise
)

# plot
fig, ax = plt.subplots()

ax.scatter(X[:100,:], Y[:100,:])
plt.tight_layout()
fig.savefig(FIG_PATH + f"demo_{dataset}.png")
plt.show()

Let's take a simple 1D distribution: a sine curve. It is clear that there is a nonlinear relationship between them that cannot be captured (well) by linear methods. We are interested in looking at the dependence between

$X$ and

$Y$ . We have the HSIC family of methods: HSIC, kernel alignment and centered kernel alignment. They are all very similar but there are some subtle differences. We will highlight them as we go through the overview. Let's take a generic approach and use the default HSIC, KA and CKA methods to try and estimate the dependence between

$X,Y$ . If we run the algorithm, we get the following results.

Question I - Which Algorithm?¶

results_df = pd.DataFrame()

method = 'scott'
per_dimension = False
separate_scales = False


# sigma_X, sigma_y = get_sigma(
#     X, Y, 
#     method=method, 
#     per_dimension=per_dimension, 
#     separate_scales=separate_scales
# )
method = 'default'
sigma_X, sigma_Y = None, None
scorer = 'hsic'

results_df = results_df.append(pd.DataFrame({
    "hsic": [get_hsic(X, Y, 'hsic', sigma_X, sigma_Y)],   # Estimate HSIC
    "ka": [get_hsic(X, Y, 'ka', sigma_X, sigma_Y)],   # Estimate KA
    "cka": [get_hsic(X, Y, 'cka', sigma_X, sigma_Y)],   # Estimate CKA
},index=['Q1']),)
print(results_df.to_markdown())

|    |      hsic |       ka |      cka |
|:---|----------:|---------:|---------:|
| Q1 | 0.0582356 | 0.688475 | 0.588434 |

Notice how all of the values are slightly difference. This is because of the composition of the methods. We can highlight the differences with a simple table.

| **Method** | **Centered Kernel** | **Normalized** | | ------------------------- | ------------------- | -------------- | | HSIC | Yes | No | | Kernel Alignment | No | Yes | | Centered Kernel Alignment | Yes | No |

So each method has a slightly different formulation but they are mostly the same. So now the next question is: how do we estimate the parameters of the kernel used? Well the default is simply $\sigma=1.0$ but we know that this won't do as the kernel depends on the parameters of the kernel. In this case we are using the most commonly used kernel: the Radial Basis Function (RBF). Since this is a 1D example, I will use some generic estimators called the "Silverman Rule" and "Scott Rule". These are very commonly found in packages like scipy.stats.gaussian_kde or statsmodels.nonparametric.bandwidth. They are mostly used for the Kernel Density Estimation (KDE) where we need a decent parameter to approximate the kernel to get a decent density estimate.

So what happens with the methods and the results?

Question II - Which Parameter Estimator?¶

methods = ['scott', 'silverman', 'median']
per_dimension = False
separate_scales = True
results_df = pd.DataFrame()

for imethod in methods:
    sigma_X, sigma_Y = get_sigma(
        X, Y, 
        method=imethod, 
        per_dimension=per_dimension, 
        separate_scales=separate_scales
    )
    results_df = results_df.append(pd.DataFrame({
#         "sigma_x": [sigma_X],
#         "sigma_y": [sigma_Y],
        'Estimator': [imethod],  
        "hsic": [get_hsic(X, Y, 'hsic', sigma_X, sigma_Y)],   # Estimate HSIC
        "ka": [get_hsic(X, Y, 'ka', sigma_X, sigma_Y)],   # Estimate KA
        "cka": [get_hsic(X, Y, 'cka', sigma_X, sigma_Y)],   # Estimate CKA
    },index=['Q2']),)

print(results_df.to_markdown())

|    | Estimator   |      hsic |       ka |      cka |
|:---|:------------|----------:|---------:|---------:|
| Q2 | scott       | 0.0575482 | 0.660478 | 0.530685 |
| Q2 | silverman   | 0.0515751 | 0.6345   | 0.515583 |
| Q2 | median      | 0.066173  | 0.702005 | 0.556274 |

Question III - How do we estimate the length scale?¶

Use the same length scale?
Use different length scales?
Use a length scale per dimension (D>1)

methods = ['scott', 'silverman', 'median']
per_dimension = False
separate_scales = [True, False]
results_df = pd.DataFrame()

for iscaler in separate_scales:
    for imethod in methods:
        sigma_X, sigma_Y = get_sigma(
            X, Y, 
            method=imethod, 
            per_dimension=per_dimension, 
            separate_scales=iscaler
        )
        results_df = results_df.append(pd.DataFrame({
    #         "sigma_x": [sigma_X],
            "separate": [iscaler],
            'Estimator': [imethod],  
            "hsic": [get_hsic(X, Y, 'hsic', sigma_X, sigma_Y)],   # Estimate HSIC
            "ka": [get_hsic(X, Y, 'ka', sigma_X, sigma_Y)],   # Estimate KA
            "cka": [get_hsic(X, Y, 'cka', sigma_X, sigma_Y)],   # Estimate CKA
        },index=['Q3']),)

print(results_df.to_markdown())

|    | separate   | Estimator   |      hsic |       ka |      cka |
|:---|:-----------|:------------|----------:|---------:|---------:|
| Q3 | True       | scott       | 0.0575482 | 0.660478 | 0.530685 |
| Q3 | True       | silverman   | 0.0515751 | 0.6345   | 0.515583 |
| Q3 | True       | median      | 0.066173  | 0.702005 | 0.556274 |
| Q3 | False      | scott       | 0.0601095 | 0.696988 | 0.596866 |
| Q3 | False      | silverman   | 0.0524045 | 0.66827  | 0.577468 |
| Q3 | False      | median      | 0.0728568 | 0.739607 | 0.620757 |

Question IV - Standardize Data?¶

We could also standardize our data... This could actually change the size of each of the features which could eliminate the need to apply separate length scales.

standardize = [True, False]
methods = ['scott', 'silverman', 'median']
per_dimension = False
separate_scales = [True, False]
results_df = pd.DataFrame()

for istandard in standardize:

    X_, Y_ = standardize_data(X, Y, istandard)
    for iscaler in separate_scales:
        for imethod in methods:
            sigma_X, sigma_Y = get_sigma(
                X_, Y_, 
                method=imethod, 
                per_dimension=per_dimension, 
                separate_scales=iscaler
            )
            results_df = results_df.append(pd.DataFrame({
                "standardize": [istandard],
                "separate": [iscaler],
                'Estimator': [imethod],  
                "hsic": [get_hsic(X_, Y_, 'hsic', sigma_X, sigma_Y)],   # Estimate HSIC
                "ka": [get_hsic(X_, Y_, 'ka', sigma_X, sigma_Y)],   # Estimate KA
                "cka": [get_hsic(X_, Y_, 'cka', sigma_X, sigma_Y)],   # Estimate CKA
            },index=['Q4']),)

print(results_df.to_markdown())

|    | standardize   | separate   | Estimator   |      hsic |       ka |      cka |
|:---|:--------------|:-----------|:------------|----------:|---------:|---------:|
| Q4 | True          | True       | scott       | 0.0601095 | 0.696988 | 0.596866 |
| Q4 | True          | True       | silverman   | 0.0524045 | 0.66827  | 0.577468 |
| Q4 | True          | True       | median      | 0.0729923 | 0.74078  | 0.623443 |
| Q4 | True          | False      | scott       | 0.0601095 | 0.696988 | 0.596866 |
| Q4 | True          | False      | silverman   | 0.0524045 | 0.66827  | 0.577468 |
| Q4 | True          | False      | median      | 0.0728568 | 0.739607 | 0.620757 |
| Q4 | False         | True       | scott       | 0.0601095 | 0.696988 | 0.596866 |
| Q4 | False         | True       | silverman   | 0.0524045 | 0.66827  | 0.577468 |
| Q4 | False         | True       | median      | 0.0729923 | 0.74078  | 0.623443 |
| Q4 | False         | False      | scott       | 0.0601095 | 0.696988 | 0.596866 |
| Q4 | False         | False      | silverman   | 0.0524045 | 0.66827  | 0.577468 |
| Q4 | False         | False      | median      | 0.0728568 | 0.739607 | 0.620757 |

Now we see that the values you get are quite different for all methods. What happens if we use different sigmas?

Verdict¶

Well, hard to say as it depends on the parameters. Every researcher I've met who dealt with kernel methods seems to have a suggestion that they swear by but I never know who to follow. My thoughts is that we should use dedicated sigma values per dataset however, that still leaves us with other methods that we may want to try. So we're going to repeat the same experiment but with a 2D dataset and we will see that the difficult will increase again.

2D Example¶

For this experiment, we're going to take two 2D datasets each generated from a T-Student distribution. We will apply the same sequence as we did above and we will end the section by adding another option for picking the parameters.

# initialize Data Params class
dataset = 'tstudent'
samples = 1_000
dimensions = 2
std = 5
nu = 8
trial = 1
standardize = False

# initialize params
example_params = DataParams(
    dataset=dataset,
    samples=samples,
    dimensions=dimensions,
    std=std,
    nu=nu,
    trial=trial,
    standardize=standardize
)

# generate some parameters
inputs = example_params.generate_data()

sns.jointplot(x=inputs.X, y=inputs.Y)
plt.tight_layout()
plt.savefig(FIG_PATH + f"demo_{dataset}.png")

Fig I: An example 2D T-Student distribution.

Question III (Revisited) - Different Length Scales?¶

Now we can revisit this question because we actually could estimate a different length scale depending upon the dimensionality. One problem with scott or Silverman's method is that it takes into account the entire dataset instead of having one estimate per feature.

methods = ['scott', 'silverman', 'median']
per_dimension = False
separate_scales = [True, False]
separate_dimensions = [True, False]
results_df = pd.DataFrame()

for iscaler in separate_scales:
    for idim in separate_dimensions:
        for imethod in methods:
            sigma_X, sigma_Y = get_sigma(
                X, Y, 
                method=imethod, 
                per_dimension=idim, 
                separate_scales=iscaler
            )
            results_df = results_df.append(pd.DataFrame({
                "standardize": [istandard],
                "Separate Dimensions": [idim],
                "Separate Length Scales": [iscaler],
                'Param Estimator': [imethod],  
                "HSIC": [get_hsic(X, Y, 'hsic', sigma_X, sigma_Y)],   # Estimate HSIC
                "KA": [get_hsic(X, Y, 'ka', sigma_X, sigma_Y)],   # Estimate KA
                "CKA": [get_hsic(X, Y, 'cka', sigma_X, sigma_Y)],   # Estimate CKA
            },index=['Q3']),)

print(results_df.to_markdown())

Q1-Q4¶

So now, let's look at all questions for the 2D data distribution

standardize = [True, False]
methods = ['scott', 'silverman', 'median']
per_dimension = False
separate_scales = [True, False]
separate_dimensions = [True, False]
results_df = pd.DataFrame()

for istandard in standardize:

    X_, Y_ = standardize_data(X, Y, istandard)
    for iscaler in separate_scales:
        for idim in separate_dimensions:
            for imethod in methods:
                sigma_X, sigma_Y = get_sigma(
                    X_, Y_, 
                    method=imethod, 
                    per_dimension=idim, 
                    separate_scales=iscaler
                )
                results_df = results_df.append(pd.DataFrame({
                    "standardize": [istandard],
                    "Separate Dimensions": [idim],
                    "Separate Length Scales": [iscaler],
                    'Param Estimator': [imethod],  
                    "HSIC": [get_hsic(X_, Y_, 'hsic', sigma_X, sigma_Y)],   # Estimate HSIC
                    "KA": [get_hsic(X_, Y_, 'ka', sigma_X, sigma_Y)],   # Estimate KA
                    "CKA": [get_hsic(X_, Y_, 'cka', sigma_X, sigma_Y)],   # Estimate CKA
                },index=['Q4']),)

print(results_df.to_markdown())

Verdict¶

For the distributions, it seemed to be a little more consistent but with higher dimensions and more samples, these estimators start to fail. But then, we still don't have good alternative estimators.

What Now?¶

I will be looking at the following:

	Options
Standardize	Yes / No
Parameter Estimator	Mean, Median, Silverman, etc
Center Kernel	Yes / No
Normalized Score	Yes / No