Information Theory Measures¶
References¶
Entropy (Shannon)¶
One Random Variable¶
If we have a discrete random variable X with p.m.f. p_x(x), the entropy is:
- This measures the expected uncertainty in X.
- The entropy is basically how much information we learn on average from one instance of the r.v. X.
values, counts = np.unique(labels, return_counts=True)
counts /= counts.sum()
H = - (counts * np.log(counts, 2)).sum()
H = entropy(counts, base=base)
Two Random Variables¶
If we have two random variables X, Y jointly distributed according to the p.m.f. p(x,y), we can come up with two more quantities for entropy.
Joint Entropy¶
This is given by:
Definition: how much uncertainty we have between two r.v.s X,Y.
Conditional Entropy¶
This is given by:
Definition: how much uncertainty remains about the r.v. X when we know the value of Y.
Properties of Entropic Quantities¶
- Non-Negativity: H(X) \geq 0, unless X is deterministic (i.e. no randomness).
-
Chain Rule: You can decompose the joint entropy measure:
H(X_1, X_2, \ldots, X_n) = \sum_{i=1}^{n}H(X_i | X^{i-1})where X^{i-1} = \{ X_1, X_2, \ldots, X_{i-1} \}. So the result is:
H(X,Y) = H(X|Y) + H(Y) = H(Y|X) + H(X) -
Monotonicity: Conditioning always reduces entropy. Information never hurts.
H(X|Y) \leq H(X)
Negentropy¶
It is simply entropy but we restrict the comparison to a Gaussian. Let's say that we have Z which comes from a normal distribution z\sim\mathcal{N}(0, \mathbb{I}). We can write the same standard KLD formulation but with the
Entropy (Renyi)¶
Above we looked at Shannon entropy which is a special case of Renyi's Entropy measure. But the generalized entropy formula actually is a generalization on entropy. Below is the given formula.
Mutual Information¶
Definition: The mutual information (MI) between two discreet r.v.s X,Y jointly distributed according to p(x,y) is given by:
Sources: * Scholarpedia
Total Correlation (Multi-Information)¶
In general, the formula for Total Correlation (TC) between two random variables is as follows:
Note: This is the same as the equation for mutual information between two random variables, I(X;Y)=H(X)+H(Y)-H(X,Y). This makes sense because for a Venn Diagram between two r.v.s will only have one part that intersects. This is different for the multivariate case where the number of r.v.s is greater than 2.
Let's have D random variables for X = \{ X_1, X_2, \ldots, X_D\}. The TC is:
In this case, D can be a feature for X.
Now, let's say we would like to get the difference in total correlation between two random variables, \DeltaTC.
Note: There is a special case in RBIG where the two random variables are simply rotations of one another. So each feature will have a difference in entropy but the total overall dataset will not. So our function would be reduced to: \Delta\text{TC}(X,Y) = \sum_{d=1}^{D}H(X_d) - \sum_{d=1}^{D} H(Y_d) which is overall much easier to solve.
Cross Entropy (Log-Loss Function)¶
Let P(\cdot) be the true distribution and Q(\cdot) be the predicted distribution. We can define the cross entropy as:
This can be thought of the measure in information length.
Note: The original cross-entropy uses \log_2(\cdot) but in a supervised setting, we can use \log_{10} because if we use log rules, we get the following relation \log_2(\cdot) = \frac{\log_{10}(\cdot)}{\log_{10}(2)}.
Kullback-Leibler Divergence (KL)¶
Furthermore, the KL divergence is the difference between the cross-entropy and the entropy.
So this is how far away our predictions are from our actual distribution.
Conditional Information Theory Measures¶
Conditional Entropy¶
Conditional Mutual Information¶
Definition: Let X,Y,Z be jointly distributed according to some p.m.f. p(x,y,z). The conditional mutual information X,Y given Z is: