Information Theory Measures using the ITE Toolbox¶

Author: J. Emmanuel Johnson
Email: jemanjohnson34@gmail.com
Date: $4^{\text{th}}$ September, $2019$

This notebook will walk-through how one can calculate a few key Information theory (IT) measures using the ITE toolbox. We have done previous experiments with the MATLAB package but there is a python version that can be useful for Python users. It's a lot cleaner but some of the functionality may be difficult to follow.

Resources¶

Gael Implementation - Gist
ITE sub imples - Github

Literature Review (what we previous did)¶

Entropy¶

In our experiments, we were only looking at Shannon entropy. It is the general case of Renyi's entropy as $\alpha \rightarrow 1$ . We chose not to look at Renyi's entropy because we did not want to go down a rabbit hole of measures that we cannont understand nor justify. So we stuck to the basics. It's also important to keep in mind that we were looking at measures that could calculate the joint entropy; i.e. for multivariate, multi-dimensional datasets.

Algorithms¶

KnnK¶

This uses the KNN method to estimate the entropy. From what I understand, it's the simplest method that may have some issues at higher dimensions and large number of samples (normal with KNN estimators). In relation to the other standard methods of density estimation, it is the most robust in higher dimensions due to its adaptive-like binning.

A new class of random vector entropy estimators and its applications in testing statistical hypotheses - Goria et. al. (2005) - Paper
Nearest neighbor estimates of entropy - Singh et. al. (2003) - paper
A statistical estimate for the entropy of a random vector - Kozachenko et. al. (1987) - paper

KDP¶

This is the logical progression from KnnK. It uses KD partitioning trees (KDTree) algorithm to speed up the calculations I presume.

Fast multidimensional entropy estimation by k-d partitioning - Stowell & Plumbley (2009) - Paper

expF¶

This is the close-form expression for the Sharma-Mittal entropy calculation for expontial families. This estimates Y using the maximum likelihood estimation and then uses the analytical formula for the exponential family.

A closed-form expression for the Sharma-Mittal entropy of exponential families - Nielsen & Nock (2012) - Paper

vME¶

This estimates the Shannon differential entropy (H) using the von Mises expansion.

Nonparametric von Mises estimators for entropies, divergences and mutual informations - Kandasamy et. al. (2015) - Paper

Ensemble¶

Estimates the entropy from the average entropy estimations on groups of samples

This is a simple implementation with the freedom to choose the estimator estimate_H.

# split into groups
for igroup in batches:
    H += estimate_H(igroup)

H /= len(batches)

High-dimensional mutual information estimation for image registration - Kybic (2004) - Paper

Potential New Experiments¶

Voronoi¶

Estimates Shannon entropy using Voronoi regions. Apparently it is good for multi-dimensional densities.

A new class of entropy estimators for multi-dimensional densities - Miller (2003) - Paper

Mutual Information¶

Total Correlation¶

Code¶

import sys, os
cwd = os.getcwd()
sys.path.insert(0, f'{cwd}/../../src')
sys.path.insert(0, f'{cwd}/../../src/itetoolbox')

import numpy as np
import ite
# from data.load_TishbyData import load_TishbyData

%matplotlib inline
%load_ext autoreload
%autoreload 2

Data¶

We will simulate some data X that is normally distributed and Y which is X that has been rotated by some random matrix A.

10**(-2)

0.01

np.random.seed(123)    # reproducibility
n_samples    = 1000
d_dimensions = 3

# create dataset X
X = np.random.randn(n_samples, d_dimensions)

# do some random rotation
A = np.random.rand(d_dimensions, d_dimensions)

# create dataset Y
Y = X @ A

Entropy¶

In our experiments, we were only looking at Shannon entropy. It is the general case of Renyi's entropy as

$\alpha \rightarrow 1$ . We chose not to look at Renyi's entropy because we did not want to go down a rabbit hole of measures that we cannont understand nor justify. So we stuck to the basics. It's also important to keep in mind that we were looking at measures that could calculate the joint entropy; i.e. for multivariate, multi-dimensional datasets.

Algorithms¶

KnnK¶

This uses the KNN method to estimate the entropy. From what I understand, it's the simplest method that may have some issues at higher dimensions and large number of samples (normal with KNN estimators).

A new class of random vector entropy estimators and its applications in testing statistical hypotheses - Goria et. al. (2005) - Paper
Nearest neighbor estimates of entropy - Singh et. al. (2003) - paper
A statistical estimate for the entropy of a random vector - Kozachenko et. al. (1987) - paper

This method works by calculating the nearest neighbors formula

KDP¶

This is the logical progression from KnnK. It uses KD partitioning trees (KDTree) algorithm to speed up the calculations I presume.

Fast multidimensional entropy estimation by k-d partitioning - Stowell & Plumbley (2009) - Paper

Algorithm¶

Calculate the KNN Distances using the distance matrix
Calculate the Volume of the unit ball wrt d_dimensions
Calculate the entropy measure

$<span><span class="MathJax_Preview">H = \log (N - 1) - \psi(k) + \log (v) + D * \frac{1}{N} \sum \log D</span><script type="math/tex">H = \log (N - 1) - \psi(k) + \log (v) + D * \frac{1}{N} \sum \log D$

def entropy_gaussian(C):
    '''
    Entropy of a gaussian variable with covariance matrix C
    '''
    if np.isscalar(C): # C is the variance
        return .5*(1 + np.log(2*pi)) + .5*np.log(C)
    else:
        n = C.shape[0] # dimension
        return .5*n*(1 + np.log(2*pi)) + .5*np.log(abs(det(C)))

Shannon Entropy (KNN/KDP)¶

from scipy.special import gamma
from sklearn.neighbors import NearestNeighbors
from typing import Optional

# volume of unit ball
def volume_unit_ball(d_dimensions: int)-> float:
    """Volume of the d-dimensional unit ball

    Parameters
    ----------
    d_dimensions : int
        Number of dimensions to estimate the volume

    Returns
    -------
    vol : float
        The volume of the d-dimensional unit ball
    """
    return ( np.pi**(.5 * d_dimensions) ) / gamma(.5 * d_dimensions + 1)


# KNN Distances
def knn_distance(X: np.ndarray, n_neighbors: int=20, algorithm: str='brute', n_jobs: int=-1, kwargs: Optional[dict]=None)-> np.ndarray:
    """Light wrapper around sklearn library.

    Parameters
    ----------
    X : np.ndarray, (n_samples x d_dimensions)
        The data to find the nearest neighbors for.

    n_neighbors : int, default=20
        The number of nearest neighbors to find.

    algorithm : str, default='brute', 
        The knn algorithm to use.
        ('brute', 'ball_tree', 'kd_tree', 'auto')

    n_jobs : int, default=-1
        The number of cores to use to find the nearest neighbors

    kwargs : dict, Optional
        Any extra keyword arguments.

    Returns
    -------
    distances : np.ndarray, (n_samples x d_dimensions)
    """
    if kwargs:
        clf_knn = NearestNeighbors(
            n_neighbors=n_neighbors,
            algorithm=algorithm,
            n_jobs=n_jobs,
            **kwargs
        )
    else:
        clf_knn = NearestNeighbors(
            n_neighbors=n_neighbors,
            algorithm=algorithm,
            n_jobs=n_jobs,
        )

    clf_knn.fit(X);

    dists, _ = clf_knn.kneighbors(X)

    return dists

from sklearn.base import BaseEstimator
from sklearn.utils import gen_batches

class Ensemble:
    def __init__(self):
        pass
    def _fit_ensemble(self, X: np.ndarray, batch_size: int=100)-> float:

        Hs = list()
        for idx in gen_batches(X.shape[0], batch_size, 10):
            Hs.append(self._fit(X[idx]))


        return np.mean(Hs)

class EntropyKNN(BaseEstimator, Ensemble):
    def __init__(self, n_neighbors: int=20, algorithm: str='brute', n_jobs: int=-1, ensemble=False, batch_size=100, kwargs: Optional[dict]=None)-> None:
        self.n_neighbors = n_neighbors
        self.algorithm = algorithm
        self.n_jobs = n_jobs
        self.ensemble = ensemble
        self.kwargs = kwargs
        self.batch_size = batch_size

    def fit(self, X: np.ndarray)-> BaseEstimator:

        self.vol = volume_unit_ball(X.shape[1])

        if self.ensemble:
            self.H_x = self._fit_ensemble(X, self.batch_size)
        else:
            self.H_x = self._fit(X)

        return self

    def _fit(self, X: np.ndarray)-> float:

        # 1. Calculate the K-nearest neighbors
        dist = knn_distance(
            X,
            n_neighbors=self.n_neighbors,
            algorithm=self.algorithm,
            n_jobs=self.n_jobs,
            kwargs=self.kwargs
        )

        return np.log(n_samples - 1) - psi(n_neighbors) + np.log(self.vol) + ( d_dimensions / n_samples) * np.log(dist[:, n_neighbors-1]).sum()

    def score(self, X):

        return self.H_x

# parameters (default)
n_neighbors = 20
algorithm = 'brute'
n_jobs = -1
ensemble = False
batch_size = 50
kwargs = {'metric': 'euclidean'}

# initialize it estimator
clf_knnK = EntropyKNN(
    n_neighbors=n_neighbors,
    algorithm=algorithm,
    n_jobs=n_jobs,
    ensemble=ensemble,
    batch_size=batch_size,
    kwargs=kwargs,

)

# estimate entropy
H_x = clf_knnK.fit(X).score(X)
H_y = clf_knnK.fit(Y).score(Y)

print(f"H(X): {H_x:.3f} bits")
print(f"H(Y): {H_y:.3f} bits")

H(X): 4.077 bits
H(Y): 2.131 bits

Notice there are quite a lot of parameters we can change within the actual KNN estimation procedure. But the rest seems to be fairly consistent with not much tweaking we can do.

ITE Toolbox implementation¶

# parameters (default)
mult        = True
knn_method  = 'cKDTree'      # fast version (slower version KNN)
k_neighbors = 10             # free parameter
eps         = 0.1            # free parameter

# initialize it estimator
clf_knnK = ite.cost.BHShannon_KnnK(
    mult=mult, 
    knn_method=knn_method,
    k=k_neighbors,
    eps=eps
)

# estimate entropy
H_x = clf_knnK.estimation(X)
H_y = clf_knnK.estimation(Y)

print(f"H(X): {H_x:.3f} bits")
print(f"H(Y): {H_y:.3f} bits")

H(X): 4.132 bits
H(Y): 2.208 bits

It seems like the numbers we get are quite similar.

Shannon Entropy: expF¶

This is the close-form expression for the Sharma-Mittal entropy calculation for expontial families. The Sharma-Mittal entropy is a generalization of the Shannon, Rényi and Tsallis entropy measurements. This estimates Y using the maximum likelihood estimation and then uses the analytical formula for the exponential family.

A closed-form expression for the Sharma-Mittal entropy of exponential families - Nielsen & Nock (2012) - Paper
Statistical exponential families: A digest with flash cards - Paper

Source Parameters

$\Lambda = (\mu, \Sigma)$

where $\mu \in \mathbb{R}^{d}$ and $\Sigma > 0$

Parameters

$\Theta = \left( \Sigma^{-1}\mu, \frac{1}{2}\Sigma^{-1} \right)$

Log Normalizer

$F(\Theta) = \frac{1}{4} Tr( \theta^\top \Theta^{-1} \theta) - \frac{1}{2} \log|\Theta| + \frac{d}{2}\log \pi$

Gradient Log Normalizer

$\nabla F(\Theta) = \left( \frac{1}{2} \Theta^{-1}\theta, -\frac{1}{2} \Theta^{-1}- \frac{1}{4}(\Theta^{-1}\Theta)(\Theta^{-1}\Theta)^\top \right)$

Final Entropy Calculation

$<span><span class="MathJax_Preview">H = F(\Theta) - \langle F(\Theta), \nabla F(\Theta) \rangle</span><script type="math/tex">H = F(\Theta) - \langle F(\Theta), \nabla F(\Theta) \rangle$

n_samples, n_dims = X.shape

# source params, theta
theta_1 = X.mean(axis=0)
theta_2 = np.cov(X.T)
print('Source:', theta_1.shape, theta_2.shape)
# natural params, eta
eta_1 = np.linalg.inv(theta_2) @ theta_1[:, None]
eta_2 = .5 * np.linalg.inv(theta_2)
print('Natural:', eta_1.shape, eta_2.shape)

# log-normalizer, F(eta)
f_eta = .25 * np.trace(eta_1.T @ np.linalg.inv(eta_2) @ eta_1) - .5 * np.linalg.slogdet(eta_2)[1] + (n_dims / 2.) * np.log(np.pi) 
print('Log Norm:', f_eta.shape)

# gradient log normalizer, dF(eta)
df_eta_1 = .5 * np.linalg.inv(eta_2) @ eta_1
df_eta_2 = -.5 * np.linalg.inv(eta_2) - .25 * (np.linalg.inv(eta_2) - eta_1) @ (np.linalg.inv(eta_2) - eta_1).T
print('Grad Log Norm:', df_eta_1.shape, df_eta_2.shape)

# outer product
t2 = np.outer(np.outer(eta_1, df_eta_1), np.outer(eta_2, df_eta_2))
print(t2.shape)

def expF_entropy(X):

    # estimate Gaussian parameters
    mean = X.mean(axis=0)
    cov = np.cov(X.T)

    # make Gaussian distribution
    norm_dist = stats.multivariate_normal(mean=mean, cov=cov, seed=seed)

    # estimate the entropy from closed form solution
    H_x = norm_dist.entropy()

    return H_x

H_x = expF_entropy(X)
H_y = expF_entropy(Y)

print(f"H(X): {H_x:.3f} bits")
print(f"H(Y): {H_y:.3f} bits")

H(X): 4.195 bits
H(Y): 1.329 bits

As you can see, it works well if the distribution is actuall Gaussian but it doesn't if it isn't.

mean = X.mean(axis=0)
cov = np.cov(X.T)
seed = 1

norm_dist = stats.multivariate_normal(mean=mean, cov=cov, seed=seed)

H_x = norm_dist.entropy()

print(f"H(X): {H_x:.3f} bits")

H(X): 4.195 bits

# 1. estimate the maximum likelihood params
mean = X.mean(axis=0)[:, None]
cov = np.cov(X.T)
inv_cov = np.linalg.inv(cov)
alpha = inv_cov @ mean
sigma = inv_cov / 2

mean.shape, cov.shape, inv_cov.shape, t1.shape, t2.shape

# Log Normalizer (Maximum Like)
F = (1/4) * np.trace(np.linalg.inv(t2) @ t1 @ t1.T) - (1/2) * np.log(np.linalg.det(cov)) + (X.shape[1] / 2) * np.log(np.pi)

# Gradient Log Normalizer
alpha_grad =

  File "<ipython-input-79-af7a3dc7f9a4>", line 14
    alpha_grad =
                 ^
SyntaxError: invalid syntax

mean.shape, cov.shape, inv_cov.shape, t1.shape, t2.shape

((3,), (3, 3), (3, 3), (3,), (3, 3))

vME¶

This nonparametric method that estimates the Shannon differential entropy (H) using the von Mises expansion. This method has a fast convergence rate than the KDE and KNN methods. This algorithm does have and in addition the can be tuned using cross-validation techniques. It is also less expensive than the KDE in terms of the numerical integration whereas this method has closed form solutions for some families of von Mises expansions.

Nonparametric von Mises estimators for entropies, divergences and mutual informations - Kandasamy et. al. (2015) - Paper

Ensemble¶

Estimates the entropy from the average entropy estimations on groups of samples

This is a simple implementation with the freedom to choose the estimator estimate_H.

# split into groups
for igroup in batches:
    H += estimate_H(igroup)

H /= len(batches)

High-dimensional mutual information estimation for image registration - Kybic (2004) - Paper

Potential New Experiments¶

Voronoi¶

Estimates Shannon entropy using Voronoi regions. Apparently it is good for multi-dimensional densities.

A new class of entropy estimators for multi-dimensional densities - Miller (2003) - Paper

Mutual Information¶

The estimation was carried out using the following relationship. Let

$XY = [X, Y] \in \mathcal{R}^{N \times D}$ , where

$D=D_1+D_2$ .

$I(XY) = \sum_{d=1}^D H(XY) - H(XY)$

The pseudo-code is fairly simple (in the MATLAB version).

Organize the components

XY = [X, Y]

Estimate the joint entropy, $H(XY)$

H_xy = - estimate_H(
    np.hstack(XY)     # stack the vectors dimension-wise
)

Estimate the marginals of XY; i.e. estimate X and Y individually, then sum them.

H_x_y = np.sum(
    # estimate the entropy for each marginal
    [estimate_H(imarginal) for imarginal in XY]
)

Summation of the two quantities

MI_XY = H_x_y + H_xy

Naive¶

class Ensemble:
    def _fit_ensemble(self, X: np.ndarray, vol: float, batch_size: int=100)-> float:

        Hs = list()
        for idx in gen_batches(X.shape[0], batch_size, 10):
            Hs.append(self._fit(X[idx], vol))


        return np.mean(Hs)

class MutualInfoKNN(BaseEstimator):
    def __init__(self, n_neighbors: int=20, algorithm: str='brute', n_jobs: int=-1, kwargs: Optional[dict]=None)-> None:
        self.n_neighbors = n_neighbors
        self.algorithm = algorithm
        self.n_jobs = n_jobs
        self.kwargs = kwargs

    def fit(self, X: np.ndarray, Y: np.ndarray)-> BaseEstimator:

        # Calculate Volumes
        vol_xy = volume_unit_ball(X.shape[1] + Y.shape[1])
        vol_x = volume_unit_ball(X.shape[1])
        vol_y = volume_unit_ball(Y.shape[1])

        # Calculate Joint Entropy
        H_xy = self._fit(np.vstack([X, Y]), vol_xy)

        # Calculate Marginal Probabilities
        H_x = self._fit(X, vol_x)
        H_y = self._fit(Y, vol_y)

        # Calculate Mutual Information
        self.MI = H_x + H_y - H_xy

        return self

    def _fit(self, X: np.ndarray, vol: float)-> float:

        # 1. Calculate the K-nearest neighbors
        dist = knn_distance(
            X,
            n_neighbors=self.n_neighbors,
            algorithm=self.algorithm,
            n_jobs=self.n_jobs,
            kwargs=self.kwargs
        )

        return np.log(n_samples - 1) - psi(n_neighbors) + np.log(vol) + ( d_dimensions / n_samples) * np.log(dist[:, n_neighbors-1]).sum()

    def score(self, X):

        return self.MI

# parameters (default)
n_neighbors = 10
algorithm = 'brute'
n_jobs = -1
ensemble = False
batch_size = 50
kwargs = {'metric': 'euclidean'}

# initialize it estimator
clf_knnK = MutualInfoKNN(
    n_neighbors=n_neighbors,
    algorithm=algorithm,
    n_jobs=n_jobs,
    kwargs=kwargs,

)

# estimate entropy
MI_xy = clf_knnK.fit(X, Y).score(X)

print(f"H(X): {MI_xy:.3f} bits")

H(X): 6.427 bits

ITE Toolbox¶

# parameters (default)
mult       = True          # ??
kl_co_name = 'BDKL_KnnK'   # KLD calculator
kl_co_pars = None          # parameters for the KLD calculator

# initialize it estimator
clf_mi = ite.cost.MIShannon_DKL(
#     mult=mult,
#     kl_co_name=kl_co_name,
#     kl_co_pars=kl_co_pars,
)

# concat data
XY = np.concatenate((X, Y), axis=1)

# individual dimensions per
sub_dimensions = np.array([X.shape[1], Y.shape[1]])

# estimate mutual information
mi_XY = clf_mi.estimation(XY, sub_dimensions)

print(f"MI(X,Y): {mi_XY:.3f} bits")

MI(X,Y): 3.683 bits

# parameters (default)
mult       = True          # ??
kernel = {'name': 'RBF','sigma': 1}   # KLD calculator
eta  = 0.01          # parameters for the KLD calculator

# initialize it estimator
clf_mi = ite.cost.BIKGV(
#     mult=mult,
#     kernel=kernel,
#     eta=eta,
)

# concat data
XY = np.concatenate((X, Y), axis=1)

# individual dimensions per
sub_dimensions = np.array([X.shape[1], Y.shape[1]])

# estimate mutual information
mi_XY = clf_mi.estimation(XY, sub_dimensions)

print(f"MI(X,Y): {mi_XY:.3f} bits")

MI(X,Y): 2.062 bits

I expect there to be some MI between X and Y since it is a rotation of the original distribution.

Total Correlation (Multi-Information, Co-Information)¶

The estimation was carried out using the following relationship:

$<span><span class="MathJax_Preview">I(x_1, x_2, \ldots, x_D) = \sum_{d=1}^D H(x_d) - H(X)</span><script type="math/tex">I(x_1, x_2, \ldots, x_D) = \sum_{d=1}^D H(x_d) - H(X)$

# parameters (default)
mult       = True
kl_co_name = 'BDKL_KnnK'
kl_co_pars = None

# initialize it estimator
clf_mi = ite.cost.MIShannon_DKL(
    mult=mult,
    kl_co_name=kl_co_name,
    kl_co_pars=kl_co_pars,
)

# concat data
sub_dimensions = np.array(range(X.shape[1]))

# estimate mutual information
tc_X = clf_mi.estimation(X, sub_dimensions)
tc_Y = clf_mi.estimation(Y, sub_dimensions)

print(f"Shannon Total Correlation, TC(X): {tc_X:.3f} bits")
print(f"Shannon Total Correlation, TC(Y): {tc_Y:.3f} bits")

Shannon Total Correlation, TC(X): -0.002 bits
Shannon Total Correlation, TC(Y): 2.365 bits

This makes since given that the original distribution

$X$ should have no correlations between dimensions because it is Gaussian. The rotation of

$X$ by some random matrix

$A$ ,

$Y=AX^{\top}$ , means that we have added some correlations between dimensions. We see that as the TC is higher.