Maximum Mean Discrepancy (MMD)#

The Maximum Mean Discrepency (MMD) measurement is a distance measure between feature means.

Idea#

This is done by taking the between dataset similarity of each of the datasets individually and then taking the cross-dataset similarity.


Formulation#

\[\begin{split}\begin{aligned} MMD^2(P,Q) &= ||\mu_P - \mu_Q||\mathcal{F}^2 \\ &= \mathbb{E}{\mathcal{X} \sim P}\left[ k(x,x')\right] + \mathbb{E}{\mathcal{Y} \sim Q}\left[ k(y,y')\right] - 2 \mathbb{E}{\mathcal{X,Y} \sim P,Q}\left[ k(x,y)\right] \end{aligned}\end{split}\]
  • Proof

    \[\begin{split}\begin{aligned} ||\mu_P - \mu_Q||\mathcal{F}^2 &= \langle \mu_P - \mu_Q, \mu_P - \mu_Q \rangle\mathcal{F} \\ &= \langle \mu_P, \mu_P \rangle_\mathcal{F} + \langle \mu_Q, \mu_Q \rangle_\mathcal{F} - 2 \langle \mu_P,\mu_Q \rangle_\mathcal{F} \\ &= \mathbb{E}{\mathcal{X} \sim P} \left[ \mu_Q(x) \right] + \mathbb{E}{\mathcal{Y} \sim Q} \left[ \mu_P(y) \right] - 2 \mathbb{E}{\mathcal{X} \sim P, Y \sim Q} \left[ \mu_P(x) \right] ??? \\ &= \mathbb{E}{\mathcal{X} \sim P} \langle \mu_P, \varphi(x) \rangle_\mathcal{F} + \mathbb{E}{\mathcal{Y} \sim Q} \langle \mu_Q, \varphi(y) \rangle\mathcal{F} - 2 ... ??? \\ &= \mathbb{E}{\mathcal{X} \sim P} \langle \mu_P, k(x, \cdot) \rangle\mathcal{F} + \mathbb{E}{\mathcal{Y} \sim Q} \langle \mu_Q, k(y, \cdot) \rangle\mathcal{F} - 2 ... ??? \\ &= \mathbb{E}{\mathcal{X} \sim P} \left[ k(x,x') \right] + \mathbb{E}{\mathcal{Y} \sim Q} \left[ k(y,y') \right] - 2 \mathbb{E}_{\mathcal{X,Y} \sim P,Q } \left[ k(x,y) \right] \end{aligned}\end{split}\]

Kernel Trick#

Let \(k(X,Y) = \langle \varphi(x), \varphi(y) \rangle_\mathcal{H}\):

\[\begin{split} \begin{aligned} \text{MMD}^2(P, Q) &= || \mathbb{E}*{X \sim P} \varphi(X) - \mathbb{E}*{Y \sim P} \varphi(Y) ||^2_\mathcal{H} \\ &= \langle \mathbb{E}*{X \sim P} \varphi(X), \mathbb{E}*{X' \sim P} \varphi(X')\rangle_\mathcal{H} + \langle \mathbb{E}*{Y \sim Q} \varphi(Y), \mathbb{E}*{Y' \sim Q} \varphi(Y')\rangle_\mathcal{H} - 2 \langle \mathbb{E}*{X \sim P} \varphi(X), \mathbb{E}*{Y' \sim Q} \varphi(Y')\rangle_\mathcal{H} \end{aligned} \end{split}\]

Source: Stackoverflow


Empirical Estimate#

Unbiased Estimate#

\[\begin{aligned} \widehat{\text{MMD}}^2 &= \frac{1}{n(n-1)} \sum_{i\neq j}^N k(x_i, x_j) + \frac{1}{n(n-1)} \sum_{i\neq j}^N k(y_i, y_j) - \frac{2}{n^2} \sum_{i,j}^N k(x_i, y_j) \end{aligned}\]
  • Code

    # Term 1
    c1 = 1 / ( m * (m - 1))
    A = np.sum(Kxx - np.diag(np.diagonal(Kxx)))
    
    # Term II
    c2 = 1 / (n * (n - 1))
    B = np.sum(Kyy - np.diag(np.diagonal(Kyy)))
    
    # Term III
    c3 = 1 / (m * n)
    C = np.sum(Kxy)
    
    # estimate MMD
    mmd_est = c1 * A + c2 * B - 2 * c3 * C
    
    

    Sources

Biased Estimate#


Equivalence#

Euclidean Distance#

Let’s assume that \(\mathbf{x,y}\) come from two distributions, so \(\mathbf{x} \sim \mathbb{P}\) and \(\mathbf{x} \sim \mathbb{Q}\). We can write the MMD as norm of the difference between the means in feature spaces.

\[\text{D}{ED}(\mathbb{P,Q}) = ||\mu\mathbf{x} - \mu_\mathbf{y}||^2_F = ||\mu_\mathbf{x}||^2_F + ||\mu_\mathbf{y}||^2_F - 2 \langle \mu_\mathbf{x}, \mu_\mathbf{y}\rangle_F\]

Empirical Estimation

This is only good for Gaussian kernels. But we can empirically estimate this as:

\[\begin{split}\text{D}{ED}(\mathbb{P,Q}) = \\ \frac{1}{N_x^2} \sum_{i=1}^{N_x}\sum_{j=1}^{N_x} \text{G}(\mathbf{x}_i, \mathbf{x}_j) \\ + \frac{1}{N_y^2} \sum_{i=1}^{N_y}\sum_{j=1}^{N_y} \text{G}(\mathbf{y}_i, \mathbf{y}_j) \\- 2 \frac{1}{N_x N_y} \sum_{i=1}^{N_x}\sum_{j=1}^{N_y} \text{G}(\mathbf{x}_i, \mathbf{y}_j)\end{split}\]

where G is the Gaussian kernel with a standard deviation of \(\sigma\).

  • Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives - Principe


KL-Divergence#

This has some alternative interpretation that is similar to the Kullback-Leibler Divergence. Remember, the MMD is the distance between the joint distribution \(P=\mathbb{P}_{x,y}\) and the product of the marginals \(Q=\mathbb{P}_x\mathbb{P}_y\).

\[\text{MMD}(P_{XY},P_X P_Y, \mathcal{H}k) = || \mu{PQ} - \mu_{P}\mu_{Q}||\]

This is similar to the KLD which has a similar interpretation in terms of the Mutual information: the difference between the joint distribution \(P(x,y)\) and the product of the marginal distributions \(p_x p_y\).

\[I(X,Y) = D_{KL} \left[ P(x,y) || p_x p_y \right]\]

Variation of Information#

In informaiton theory, we have a measure of variation of information (aka the shared information distance) which a simple linear expression involving mutual information. However, it is a valid distance metric that obeys the triangle inequality.

\[\text{VI}(X,Y) = H(X) + H(Y) - 2 I (X,Y)\]

where \(H(X)\) is the entropy of \(\mathcal{X}\) and \(I(X,Y)\) is the mutual information between \(\mathcal{X,Y}\).

Properties

  • \(\text{VI}(X,Y) \geq 0\)

  • \(\text{VI}(X,Y) = 0 \implies X=Y\)

  • \(\text{VI}(X,Y) = d(Y,X)\)

  • \(\text{VI}(X,Z) \leq d(X,Y) + d(Y,Z)\)


HSIC#

Similar to the KLD interpretation, this formulation is equivalent to the Hilbert-Schmidt Independence Criterion. If we think of the MMD distance between the joint distribution & the product of the marginals then we get the HSIC measure.

\[ \begin{aligned} \text{MMD}^2(P_{XY}, P_XP_Y; \mathcal{H}*k) &= ||\mu*{\mathbb{P}*{XY}} - \mu*{P_XP_Y}|| \end{aligned} \]

which is the exact formulation for HSIC.

\[ \begin{aligned} \text{MMD}^2(P_{XY}, P_XP_Y; \mathcal{H}*k) &= \text{HSIC}^2(P*{XY}; \mathcal{F}, \mathcal{G}) \end{aligned} \]

where we have some equivalences.

Proof#

First we need to do some equivalences. First the norm of two feature spaces \(\varphi(\cdot, \cdot)\) is the same as the kernel of the cross product.

\[ \begin{aligned} \langle \varphi(x,y), \varphi(x,y) \rangle_\mathcal{F} &= k \left((x,y),(x',y')\right) \end{aligned} \]

The second is the equivalence of the kernel of the cross-product of \(\mathcal{X,Y}\) is equal to the multiplication of the respective kernels for \(\mathcal{X,Y}\). So, let’s say we have a kernel \(k\) on dataset \(\mathcal{X}\) in the feature space \(\mathcal{F}\). We also have a kernel \(k\) on dataset \(\mathcal{Y}\) with feature space \(\mathcal{G}\). The kernel \(k\) on the \(\mathcal{X,Y}\) pairs are similar.

\[\begin{split} \begin{aligned} k\left((x,y),(x',y')\right) &= k(x,x')\,k(y,y') \\ \end{aligned} \end{split}\]

Resources#

\[\text{MMD}(\mathcal{F}, p, q) = \mathbb{E}_{\mathbf{x,x}'}\left[k(\mathbf{x,x'})\right] + \mathbb{E}_{\mathbf{y,y}'}\left[k(\mathbf{y,y'})\right] -2 \mathbb{E}_{\mathbf{x,y}'}\left[k(\mathbf{x,y})\right]\]

Biased Estimate

  • Code

    mmd = Kx.mean() + Ky.mean() - 2 * Kxy.mean()
    

Unbiased Estimate