Losses#

KL-Divergence vs Negative Log-Likelihood#

Here we want to show that the KL-Divergence between the true distribution \(p_\text{data}(\mathbf{x})\) and the estimated distribution \(p_\theta(\mathbf{x})\) is the same as maximizing the likelihood of our estimated distribution \(p_\theta(\mathbf{x})\).

\[ \begin{equation} \text{D}_\text{KL}\left[p_\text{data}(\mathbf{x}) || p_\theta(\mathbf{x}) \right] = \mathbb{E}_{p_\text{data}(\mathbf{x})} \left[ \log p_\theta(\mathbf{x}) \right] + \text{constant} \end{equation} \]

Constructive-Destructive KL-Divergence#

Let \(\boldsymbol{f}_\theta\) be the invertible, bijective normalizing function which maps \(\mathbf{x}\) to \(\mathbf{z}\), i.e. \(\boldsymbol{f}_\theta:\mathbf{x} \in \mathbb{R}^D \rightarrow \mathbf{z} \in \mathbb{R}^D\). Let \(g_\theta\) be the inverse of \(\boldsymbol{f}_\theta\) which is the generating function mapping \(\mathbf{z}\) to \(\mathbf{x}\), i.e. \(\boldsymbol{g}_\theta := \boldsymbol{f}_\theta^{-1} :\mathbf{z} \in \mathbb{R}^D \rightarrow \mathbf{x} \in \mathbb{R}^D\). We can view \(\boldsymbol{f}_\theta\) as a destructive density whereby we “destroy” the density of the original dataset \(p_\text{data}(\mathbf{x})\) into a common base density \(p_\mathbf{z}\). Conversely, we can view \(g_\theta\) as a constructive density whereby we “construct” the density of the original dataset \(p_\text{data}(\mathbf{x})\) from a base density \(p_\mathbf{z}\).

\[ \begin{equation} \mathbf{z} = \boldsymbol{f}_\theta(\mathbf{x}), \qquad \mathbf{x} = g_\theta(\mathbf{z}) \end{equation} \]

We’re assuming \(\mathbf{z}\sim p_\mathbf{z}(\mathbf{z})\). Using the change of variables formula, we can express the probability of \(p_\theta(\mathbf{x})\) in terms of \(\mathbf{z}\) and the transform \(\boldsymbol{f}_\theta\).

\[ \begin{equation} p_\theta(\mathbf{x}) = p_\mathbf{z}(\boldsymbol{f}_\theta(\mathbf{x})) \left| \nabla_\mathbf{x} \boldsymbol{f}_\theta(\mathbf{x})\right| \end{equation} \]

This function \(\boldsymbol{f}_\theta\) “normalizes” the complex density \(\mathbf{x}\) into a simpler base distribution \(\mathbf{z}\). We can also express this equation in terms of \(g_\theta\) which is the standard found in the normalizing flow literature.

\[ \begin{equation} p_\theta(\mathbf{x}) = p_\mathbf{z}(\mathbf{z}) \left| \nabla_\mathbf{z} g_\theta(\mathbf{z})\right|^{-1} \end{equation} \]

The function \(g_\theta\) pushes forward the base density \(\mathbf{z}\) to a more complex density \(\mathbf{x}\).

In this demonstration, we want to show that the following is equivalent.

\[ \begin{equation} \text{D}_\text{KL}\left[p_\text{data}(\mathbf{x}) || p_\mathbf{x}(\mathbf{x}; \theta) \right] = \text{D}_\text{KL} \left[ p_\text{target}(\mathbf{z}; \theta) || p_\mathbf{z}(\mathbf{z}) \right] \end{equation} \]

This says that the KL-Divergence between the data distribution \(p_\text{data}(\mathbf{x})\) and the model \(p_\mathbf{x}(\mathbf{x};\theta)\) is equivalent to the KL-Divergence between \textit{induced} distribution \(p_\text{target}(\mathbf{z};\theta)\) from the transformation \(\boldsymbol{f}_\theta(\mathbf{x})\) and the chosen base distribution \(p_\mathbf{z}(\mathbf{z})\).