Kernels and Information Measures¶

This post will be based off of the paper from the following papers:

Measures of Entropy from Data Using Infinitely Divisible Kernels - Giraldo et. al. (2014)
Multivariate Extension of Matrix-based Renyi's $\alpha$ -order Entropy Functional - Yu et. al. (2018)

Short Overview¶

The following IT measures are possible with the scheme mentioned above:

Entropy
Joint Entropy
Conditional Entropy
Mutual Information

Kernel Matrices¶

$\begin{aligned} \hat{f}(x) &= \frac{1}{N} \sum_{i=1}^N K_\sigma (x, x_i) \\ K_\sigma(x_i, x_j) &= \frac{1}{(2\pi \sigma^2)^{d/2}}\exp\left( - \frac{||x-x_i||^2_2}{2\sigma^2} \right) \end{aligned}$

Entropy¶

$\alpha=1$ ¶

In this case, we can show that for kernel matrices, the Renyi entropy formulation becomes the eigenvalue decomposition of the kernel matrix.

$$ \begin{aligned} H_1(x) &= \log \int_\mathcal{X}f^1(x) \cdot dx \

\end{aligned}$$

$\alpha=2$ ¶

In this case, we will have the Kernel Density Estimation procedure.

$H_2(x) = \log \int_\mathcal{X}f^2(x)\cdot dx$

$H_2(x) = - \log \frac{1}{N^2} \sum_{i=1}^{N} \sum_{j=1}^N K_{\sqrt{2}\sigma}(x_i, x_j)$

$H_2(x) = - \log \frac{1}{N^2} \sum_{i=1}^{N} \sum_{j=1}^N K_{\sqrt{2}\sigma}$

Note: We have to use the convolution theorem for Gaussian functions. Source |

Practically¶

We can calculate this above formulation by simply multiplying the kernel matrix $K_x$ by the vector $1_N$ .

$\hat{H}_2(x) = - \log \frac{1}{N^2} \mathbf{1}_N^\top \mathbf{K}_x \mathbf{1}_N$

where $\mathbf{1}_N \in \mathbf{R}^{1 \times N}$ . The quantity $\mathbf{1}_N^\top \mathbf{K}_x \mathbf{1}_N$ is known as the information potential, $V$ .

Cross Information Potential RKHS¶

$\hat{H}(X) = -log\left(\frac{1}{n^2} \text{tr}(KK) \right) + C(\sigma)$

Cross Information Potential $\mathcal{V}$

Joint Entropy¶

This formula uses the above entropy formulation. To incorporate both r.v.'s $X,Y$ , we construct two kernel matrices $A,B$ respectively

$S_\alpha(A,B) = S_\alpha \left( \frac{A \circ B}{\text{tr}(A \circ B)} \right)$

Note: * The trace is there for normalization. * The matrices $A,B$ have to be the same size (due to the Hadamard product).

Multivariate¶

This extends to multiple variables. Let's say we have L variables, then we can calculate the joint entropy like so:

$S_\alpha(A_1, A_2, \ldots, A_L) = S_\alpha \left( \frac{A_1 \circ A_2 \circ \ldots \circ A_L}{\text{tr}(A_1 \circ A_2 \circ \ldots \circ A_L)} \right)$

Conditional Entropy¶

This formula respects the traditional formula for conditional entropy; the joint entropy of r.v. $X,Y$ minus the entropy of r.v. $Y$ , ( $H(X|Y) = H(X,Y) - H(Y)$ ). Assume we have the kernel matrix for r.v. $X$ as $A$ and the kernel matrix for r.v. $Y$ as $B$ . The following formula shows how this is calculated using kernel functions.

$S_\alpha(A|B) = S_\alpha \left( \frac{A \circ B}{\text{tr}(A \circ B)} \right) - S_\alpha(B)$

Mutual Information¶

The classic Shannon definition is the sum of the marginal entropies mines the intersection between the r.v.'s $X,Y$ , i.e. $MI(X;Y)=H(X)+H(Y)-H(X,Y)$ . The following formula shows the MI with kernels:

$I_\alpha(A;B) = S_\alpha(A) + S_\alpha(B) - S_\alpha \left( \frac{A \circ B}{\text{tr}(A \circ B)} \right)$

The definition is the exactly the same and utilizes the entropy and joint entropy formulations above.

Multivariate¶

This can be extended to multiple variables. Let's use the same example for multi-variate solutions. Let's assume $B$ is univariate but $A$ is multivariate, i.e. $A \in \{A_1, A_2, \ldots, A_L \}$ . We can write the MI as:

$I_\alpha(A;B) = S_\alpha(A) + S_\alpha(B) - S_\alpha \left( \frac{A_1 \circ A_2 \circ \ldots \circ A_L \circ B}{\text{tr}(A_1 \circ A_2 \circ \ldots \circ A_L \circ B)} \right)$

Total Correlation¶

This is a measure of redundancy for multivariate data. It is basically the entropy of each of the marginals minus the joint entropy of the multivariate distribution. Let's assume we have $A$ as a multivarate distribution, i.e. $A \in \{A_1, A_2, \ldots, A_L \}$ . Thus we can write this distribution using Kernel matrices:

$T_\alpha(\mathbf{A}) = H_\alpha(A_1) + H_\alpha(A_2) + \ldots + H_\alpha(A_d) - H_\alpha \left( \frac{A_1 \circ A_2 \circ \ldots \circ A_d}{\text{tr}(A_1 \circ A_2 \circ \ldots \circ A_d)} \right)$

Kernels and Information Measures¶

Short Overview¶

Kernel Matrices¶

Entropy¶

\alpha=1\alpha=1¶

\alpha=2\alpha=2¶

Practically¶

Cross Information Potential RKHS¶

Joint Entropy¶

Multivariate¶

Conditional Entropy¶

Mutual Information¶

Multivariate¶

Total Correlation¶

$\alpha=1$ ¶

$\alpha=2$ ¶