Maximum Mean Discrepancy (MMD)#
The Maximum Mean Discrepency (MMD) measurement is a distance measure between feature means.
Idea#
This is done by taking the between dataset similarity of each of the datasets individually and then taking the cross-dataset similarity.
Formulation#
Proof
\[\begin{split}\begin{aligned} ||\mu_P - \mu_Q||\mathcal{F}^2 &= \langle \mu_P - \mu_Q, \mu_P - \mu_Q \rangle\mathcal{F} \\ &= \langle \mu_P, \mu_P \rangle_\mathcal{F} + \langle \mu_Q, \mu_Q \rangle_\mathcal{F} - 2 \langle \mu_P,\mu_Q \rangle_\mathcal{F} \\ &= \mathbb{E}{\mathcal{X} \sim P} \left[ \mu_Q(x) \right] + \mathbb{E}{\mathcal{Y} \sim Q} \left[ \mu_P(y) \right] - 2 \mathbb{E}{\mathcal{X} \sim P, Y \sim Q} \left[ \mu_P(x) \right] ??? \\ &= \mathbb{E}{\mathcal{X} \sim P} \langle \mu_P, \varphi(x) \rangle_\mathcal{F} + \mathbb{E}{\mathcal{Y} \sim Q} \langle \mu_Q, \varphi(y) \rangle\mathcal{F} - 2 ... ??? \\ &= \mathbb{E}{\mathcal{X} \sim P} \langle \mu_P, k(x, \cdot) \rangle\mathcal{F} + \mathbb{E}{\mathcal{Y} \sim Q} \langle \mu_Q, k(y, \cdot) \rangle\mathcal{F} - 2 ... ??? \\ &= \mathbb{E}{\mathcal{X} \sim P} \left[ k(x,x') \right] + \mathbb{E}{\mathcal{Y} \sim Q} \left[ k(y,y') \right] - 2 \mathbb{E}_{\mathcal{X,Y} \sim P,Q } \left[ k(x,y) \right] \end{aligned}\end{split}\]
Kernel Trick#
Let \(k(X,Y) = \langle \varphi(x), \varphi(y) \rangle_\mathcal{H}\):
Source: Stackoverflow
Empirical Estimate#
Unbiased Estimate#
Code
# Term 1 c1 = 1 / ( m * (m - 1)) A = np.sum(Kxx - np.diag(np.diagonal(Kxx))) # Term II c2 = 1 / (n * (n - 1)) B = np.sum(Kyy - np.diag(np.diagonal(Kyy))) # Term III c3 = 1 / (m * n) C = np.sum(Kxy) # estimate MMD mmd_est = c1 * A + c2 * B - 2 * c3 * C
Sources
Biased Estimate#
Equivalence#
Euclidean Distance#
Let’s assume that \(\mathbf{x,y}\) come from two distributions, so \(\mathbf{x} \sim \mathbb{P}\) and \(\mathbf{x} \sim \mathbb{Q}\). We can write the MMD as norm of the difference between the means in feature spaces.
Empirical Estimation
This is only good for Gaussian kernels. But we can empirically estimate this as:
where G is the Gaussian kernel with a standard deviation of \(\sigma\).
Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives - Principe
KL-Divergence#
This has some alternative interpretation that is similar to the Kullback-Leibler Divergence. Remember, the MMD is the distance between the joint distribution \(P=\mathbb{P}_{x,y}\) and the product of the marginals \(Q=\mathbb{P}_x\mathbb{P}_y\).
This is similar to the KLD which has a similar interpretation in terms of the Mutual information: the difference between the joint distribution \(P(x,y)\) and the product of the marginal distributions \(p_x p_y\).
Variation of Information#
In informaiton theory, we have a measure of variation of information (aka the shared information distance) which a simple linear expression involving mutual information. However, it is a valid distance metric that obeys the triangle inequality.
where \(H(X)\) is the entropy of \(\mathcal{X}\) and \(I(X,Y)\) is the mutual information between \(\mathcal{X,Y}\).
Properties
\(\text{VI}(X,Y) \geq 0\)
\(\text{VI}(X,Y) = 0 \implies X=Y\)
\(\text{VI}(X,Y) = d(Y,X)\)
\(\text{VI}(X,Z) \leq d(X,Y) + d(Y,Z)\)
HSIC#
Similar to the KLD interpretation, this formulation is equivalent to the Hilbert-Schmidt Independence Criterion. If we think of the MMD distance between the joint distribution & the product of the marginals then we get the HSIC measure.
which is the exact formulation for HSIC.
where we have some equivalences.
Proof #
First we need to do some equivalences. First the norm of two feature spaces \(\varphi(\cdot, \cdot)\) is the same as the kernel of the cross product.
The second is the equivalence of the kernel of the cross-product of \(\mathcal{X,Y}\) is equal to the multiplication of the respective kernels for \(\mathcal{X,Y}\). So, let’s say we have a kernel \(k\) on dataset \(\mathcal{X}\) in the feature space \(\mathcal{F}\). We also have a kernel \(k\) on dataset \(\mathcal{Y}\) with feature space \(\mathcal{G}\). The kernel \(k\) on the \(\mathcal{X,Y}\) pairs are similar.
Resources#
Similarity Loss (TF2) - code
Biased Estimate
Code
mmd = Kx.mean() + Ky.mean() - 2 * Kxy.mean()
Unbiased Estimate