Comparing HSIC versus Mutual Information measures¶

In this notebook, we will be exploring how we can estimate the HSIC parmaeter for different distributions and look at how it compares to MI measures. Normally the procedure for calculating HSIC is as follows:

Calculate kernel matrices for X and Y
Center both kernel matrices
Find the Frobenius norm between the kernel matrices

This works well but there is no certain way to estimate the parameter for each of the kernel matrices. There is another paper that is called the Kernel Tangent Alignment (KTA). This method is different as it is calculated like so:

Calculate the kernel matrices for X and Y
Find the Frobenius norm between the kernel matrices
Normalize the value by the Frobenius norm of X and Y individually

This works in a similar way to the HSIC method. The difference is that you do the normalization procedure. The final algorithm is the Centered Kernel Tangent Alignment (cKTA) method which is a combination of the previous two methods. The algorithm is as follows:

Calculate the kernel matrices for X and Y
Center both kernel matrices
Find the Frobenius norm between the kernel matrices
Normalize the value by the Frobenius norm of X and Y individually

As you can see, it would appear that this method is the most complete in the sense that it incorporates all steps to ensure that our data is sufficiently scaled to a way that is directly comparable. This notebook will attempt to see which of these methods provides a good estimate for a sigma value for the kernel matrices and how do their output values compare to mutual information measures.

Covariance vs Correlation¶

The above methods can be put into perspective of the difference between covariance measures and correlation meausres. In this section, I will explain why that is so. We can write out the full definitions for covariance and correlation. The definition of covariance is:

$\text{cov}_{XY}=\sigma_{XY}=E\left[(X - \mu_X)(Y - \mu_Y) \right]$

The definition of correlation is:

$\text{corr}_{XY}=\rho_{XY}=\frac{\sigma_{XY}}{\sigma_X \sigma_Y}$

The correlation measure is dimensionless whereas the covariance is in units obtained by multiplying the units of the two variables.

Source:

Wikipedia on Covariance and correlation
Wikipedia on Correlation and dependence

Code¶

import sys, os
import warnings
import tqdm
import random
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

# Insert path to model directory,.
cwd = os.getcwd()
path = f"{cwd}/../../src"
sys.path.insert(0, path)

# toy datasets
from data.toy import generate_dependence_data

# Kernel Dependency measure
from models.dependence import HSIC
from models.kernel import estimate_sigma, sigma_to_gamma, gamma_to_sigma, get_param_grid

# RBIG IT measures
from models.ite_algorithms import run_rbig_models

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

warnings.filterwarnings('ignore') # get rid of annoying warnings

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Experiment Class¶

The entire experiment is contained within this class. It features the fixed param, the free params, all the methods with the preprocessing and loading, and the hsic methods.

class ExperimentKTA:
    def __init__(
        self, 
        seed=123, 
        n_trials=10, 
        hsic_points=1000, 
        n_noise=10, 
        n_gamma=50, 
        factor=2,
        mi_points=100_000, 
        sigma_est='silverman',
        save_path=None,
        save_name='test'
    ):

        # fixed experimental params
        self.seed        = seed
        self.hsic_points = hsic_points
        self.n_noise     = n_noise
        self.n_gamma     = n_gamma
        self.factor      = factor
        self.mi_points   = mi_points
        self.sigma_est   = sigma_est
        self.n_trials    = n_trials
        self.save_path   = save_path
        self.save_name   = save_name

        # free experimental params
        self.noise_params = np.logspace(-3, -.3, n_noise)
        self.func_params = [
            'line',
            'sine',
            'circ',
            'rand',
        ]
        self.scorers = [
            'hsic',
            'tka',
            'ctka',
        ]
        self.seeds = [i for i in range(1, n_trials + 1)]

        # saved dataframe

        pass

    def run_experiment(self):

        # initialize results dataframe
        self.results_df = self.generate_results_df()

        # Loop through functions
        for ifunction in self.func_params:
            print(f"Function: {ifunction}")

            # Loop through noise parameters
            for inoise in self.noise_params:

                # Loop through random seeds
                for iseed in self.seeds:
                    # generate data for MI measure
                    X, Y = self._generate_data(
                        inoise, ifunction, iseed, dataset='mi'
                    )

                    # calculate MI
                    mi_score, _ = run_rbig_models(
                        X, Y,
                        measure='mi',
                        verbose=None,
                        random_state=self.seed
                    )

                    # =======================
                    # HSIC MEASURES
                    # =======================

                    # initialize sigma
                    init_sigma, sigma_params = self._estimate_sigmas(X, Y)

                    # convert to gamma
                    init_gamma = sigma_to_gamma(init_sigma)
                    gamma_params = sigma_to_gamma(sigma_params)


                    # Loop through HSIC scoring methods
                    for hsic_method in self.scorers:

                        # Loop through gamma parameters
                        for igamma in gamma_params:

                            # generate data for MI measure
                            X, Y = self._generate_data(
                                inoise, ifunction, iseed, dataset='hsic'
                            )

                            # Calculate HSIC
                            hsic_score = self._get_hsic(X, Y, igamma, None, hsic_method)

                            # append results to results dataframe
                            self.results_df = self.append_results(
                                self.results_df, 
                                ifunction, 
                                iseed, 
                                inoise, 
                                init_gamma, 
                                igamma, 
                                hsic_method, 
                                hsic_score, 
                                mi_score
                            )

                            # save results to csv
                            self.save_data(self.results_df)
        return self

    def _generate_data(self, noise, function, seed, dataset='hsic'):

        if dataset == 'hsic':
            num_points = self.hsic_points
        elif dataset == 'mi':
            num_points = self.mi_points
        else:
            raise ValueError(f'Unrecoginized dataset: {dataset}')

        # get dataset
        X, Y = generate_dependence_data(
            dataset=function,
            num_points=num_points,
            seed=seed,
            noise_x=noise,
            noise_y=noise
        )
        return X, Y

    def _estimate_sigmas(self, X, Y):

        # estimate initialize sigma
        sigma_x = estimate_sigma(X, method=self.sigma_est)
        sigma_y = estimate_sigma(Y, method=self.sigma_est)

        # init overall sigma is mean between two
        init_sigma = np.mean([sigma_x, sigma_y])

        # get parameter grid
        sigmas = get_param_grid(init_sigma, self.factor, self.n_gamma)

        # convert to gammas

        return init_sigma, sigmas

    def _get_hsic(self, X, Y, gamma, subsample, scorer):

        # initialize hsic
        clf_hsic = HSIC(gamma=gamma, scorer=scorer, subsample=subsample)

        # calculate HSIC value
        clf_hsic.fit(X, Y)

        # return HSIC score
        return clf_hsic.score(X)

    def generate_results_df(self):
        return pd.DataFrame(columns=[
            'trial',
            'function',
            'noise',
            'init_gamma',
            'gamma',
            'scorer',
            'value',
            'mi',
        ])

    def append_results(
        self, 
        results_df, 
        function, 
        trial, 
        noise, 
        init_gamma, 
        gamma, 
        hsic_method, 
        hsic_score, 
        mi_score
    ):  
        # append data 
        return results_df.append({
            'function': function,
            'trial': trial,
            'noise': noise,
            'init_gamma': init_gamma,
            'gamma': gamma,
            'scorer': hsic_method,
            'value': hsic_score,
            'mi': mi_score
        }, ignore_index=True)

    def load_data(self):
        pass

    def save_data(self, results_df):
        results_df.to_csv(f"{self.save_path}{self.save_name}.csv")

Experiment Run¶

# experimental params
seed        = 123                   # reproducibility
n_trials    = 1                     # number of trials
hsic_points = 10_000                # number of points used for HSIC
mi_points   = 10_000                # number of points used for MI
n_noise     = 50                    # number of points in noise param grid
n_gamma     = 50                    # number of points in gamma param grid
factor      = 1                     # log factor for gamma param grid bounds
sigma_est   = 'mean'                # sigma initialization
save_path   = f'{cwd}/../../results/hsic/'
save_name   = 'trial_v2_s10k'

# initialize experiment class
clf_exp = ExperimentKTA(
    seed=seed,
    n_trials=n_trials,
    hsic_points=hsic_points,
    factor=factor,
    mi_points=mi_points,
    sigma_est=sigma_est,
    n_noise=n_noise,
    n_gamma=n_gamma,
    save_path=save_path,
    save_name=save_name,
)

# run full experiment
clf_exp.run_experiment()

Function: line

projects/2019_hsic_align/results/hsic

Example¶

Visualize the Results¶

Figure I - Mutual Information & Noise¶

This first figure is to demonstrate how the mutual information compares with the amount of noise for each of the functions Linear, Sinusoidal, Circular, and Random.

res_noise = results_df['noise'].unique().tolist()
res_gammas = results_df['gamma'].unique().tolist()
res_lines = results_df['function'].unique().tolist()

fig, ax = plt.subplots()
ax.scatter(
    results_df[results_df['function'] == 'line']['noise'], 
    results_df[results_df['function'] == 'line']['mi'],
    label='Line'
)
plt.xscale('log')
plt.yscale('log')

ax.scatter(
    results_df[results_df['function'] == 'sine']['noise'], 
    results_df[results_df['function'] == 'sine']['mi'], 
    label='Sine'
)
plt.xscale('log')
plt.yscale('log')


ax.scatter(
    results_df[results_df['function'] == 'circ']['noise'], 
    results_df[results_df['function'] == 'circ']['mi'], 
    label='Circle'
)

plt.xscale('log')
plt.yscale('log')

ax.scatter(
    results_df[results_df['function'] == 'rand']['noise'], 
    results_df[results_df['function'] == 'rand']['mi'], 
    label='Random'
)

plt.xscale('log')
plt.yscale('log')

ax.set_xlabel('Noise, $\sigma_y$')
ax.set_ylabel('Mutual Information, MI$(X,Y)$')
# ax.set_xscale('log')
plt.legend()
ax.set_title('Experimental Parameter Space')
plt.show()

Line

This has the most amount of MI with little noise but log-linearly decreases as we increase the noise.

Sine

This has the 2^nd most amount of MI with little noise. It also decreases as wee increase the noise. Curiously, we find that it has more MI than the linear function at some point around $10^{-2}$ amount of noise. It could be due to the amount of structure still present within the sine function.

Circle

The circle has less MI overall than the linear and sine function and also decreases with added noise.

Random

There is no mutual information between them no matter how much more noise we inject.

Plot Function¶

def plot_res_gamma(results_df, function='line', hsic_method='hsic'):

    save_path = f'{cwd}/../../results/hsic/figures/'

    # Set Title stuff
    if function == 'line':
        title = 'Linear Function'
    elif function == 'sine':
        title = 'Sine Function'
    elif function == 'circ':
        title = 'Circle Function'
    elif function == 'rand':
        title = 'Random Function'
    else:
        raise ValueError(f'Unrecognized function: {line}')

    sub_results_df = results_df[results_df['function'] == function]
    free_params = [
    #     'gamma',
        'function'
    ]

    fixed_params = [
        'gamma',
        'value',
        'method',
        'mi'
    ]

    groups = sub_results_df.groupby(free_params)

    hue = 'gamma'

    fig, ax = plt.subplots(nrows=1, figsize=(7, 3))


    for iparams, idata in groups:

        # Plot I - HSIC
        pts = ax.scatter(
            x=idata[idata['scorer']== hsic_method]['value'],
            y=idata[idata['scorer']== hsic_method]['mi'],
            c=idata[idata['scorer']== hsic_method]['gamma'],
            s=20, cmap='Spectral',
            norm=matplotlib.colors.LogNorm()
        )
        ax.set_xlabel( hsic_method.upper() )
        ax.set_ylabel('Mutual Information')
        fig.colorbar(pts, ax=ax, label='Gamma')

    # ax[0].get_legend().remove()
    # ax[1].get_legend().remove()
    # ax[2].get_legend().remove()
    ax.set_yscale('log')
    ax.set_title(title)
    plt.tight_layout()
    plt.show()

    fig.savefig(f"{save_path}trialv1_{function}_{hsic_method}.png")

    return None

Figure II - MI vs Gamma vs HSIC (Linear Function)¶

plot_res_gamma(results_df, function='line', hsic_method='hsic')
plot_res_gamma(results_df, function='line', hsic_method='tka')
plot_res_gamma(results_df, function='line', hsic_method='ctka')

Observation I: The scaling

The scales for HSIC method is different than the other two methods. Both KTA and cKTA are bounded at 1 whereas HSIC is only bounded at 0. Or at least from what we can see from the graphs given. It's a slightly different interpretation in the sense that c/KTA will model if they are the same whereas HSIC models if they are

Observation II: The shape of the parameter space

We can see in the figures above that each of the HSIC estimators have different overall shapes. The HSIC has a minimum of 0 always but the others have a minimum of 1. KTA has a maximum of 1 always but cKTA's maximum changes as more noise is inputed into the system. I would say that the shapes for the maximum values for HSIC and the minimum values of c/KTA are similar with respect to the amount of noise but HSIC is inverted than the others.

Observation III: Reasonable values for minimum noise

The c/KTA methods have values close to 1 for almost all gamma values until the MI goes to about 3. After that, the values spread out more. HSIC has a large range of values that can be found but it seems that only with noise do the range of values decrease.

Case II - Sine¶

plot_res_gamma(results_df, function='sine', hsic_method='hsic')
plot_res_gamma(results_df, function='sine', hsic_method='tka')
plot_res_gamma(results_df, function='sine', hsic_method='ctka')

This is a bit trickier. The HSIC seems to be similar. But the other two functions change.

It appears that there is a maximum for the HSIC and the cKTA but not for the KTA. Yeet I think the max of the cKTA corresponds to the same values as the KTA.

Case III - Circle¶

plot_res_gamma(results_df, function='circ', hsic_method='hsic')
plot_res_gamma(results_df, function='circ', hsic_method='tka')
plot_res_gamma(results_df, function='circ', hsic_method='ctka')

This is the trickiest case. The HSIC still has a maximum value. But again, the correspondence to meaning doesn't work with the HSIC value scaling.

The CTKA and the KTA both don't seem to have a maximum for higher values of MI.

Case IV - Random¶

plot_res_gamma(results_df, function='rand', hsic_method='hsic')
plot_res_gamma(results_df, function='rand', hsic_method='tka')
plot_res_gamma(results_df, function='rand', hsic_method='ctka')

Completely random makes sense. However, the KTA is troubling because the range spans from 0.3 to 1.0 and there is no 0 included...

Both the HSIC and the cKTA's range of gamma values seem reasonable.