Losses#
KL-Divergence vs Negative Log-Likelihood#
Here we want to show that the KL-Divergence between the true distribution \(p_\text{data}(\mathbf{x})\) and the estimated distribution \(p_\theta(\mathbf{x})\) is the same as maximizing the likelihood of our estimated distribution \(p_\theta(\mathbf{x})\).
Proof
First we decompose the KL-Divergence into its log terms.
The first term is the entropy of our data, \(H\left(p_\text{data}(\mathbf{x}) \right)\). This term doesn’t depend on our parameters \(\theta\) which means it will be constant irregardless of how well we estimate \(p_\theta(\mathbf{x})\). So we can simplify this function.
The remaining term is the cross-entropy; the expected amount of bits need to compress. This is optimal when \(p_\text{data}(\mathbf{x}) = p_\theta(\mathbf{x})\) (cite: Shannon Source Coding Theorem). Let \(p_\text{data}(\mathbf{x})\) be an empirical distribution described by a delta.
We assume that it puts a probability on the observed data and zero everywhere else. Plugging this into our KL-Divergence function, we get:
Then using the law of large numbers where given enough samples we can empirically estimate this integral, we can simplify this even further:
We are left with the log-likelihood term. So maximizing the likelihood of our estimated distribution \(p_\theta(\mathbf{x})\) is equivalent to minimizing the difference between the estimated distribution \(p_\theta(\mathbf{x})\) and the real distribution \(p_\text{data}(\mathbf{x})\). This is a proxy method allowing us to find the parameters \(\theta\) without explicitly knowing the real distribution.
Constructive-Destructive KL-Divergence#
Let \(\boldsymbol{f}_\theta\) be the invertible, bijective normalizing function which maps \(\mathbf{x}\) to \(\mathbf{z}\), i.e. \(\boldsymbol{f}_\theta:\mathbf{x} \in \mathbb{R}^D \rightarrow \mathbf{z} \in \mathbb{R}^D\). Let \(g_\theta\) be the inverse of \(\boldsymbol{f}_\theta\) which is the generating function mapping \(\mathbf{z}\) to \(\mathbf{x}\), i.e. \(\boldsymbol{g}_\theta := \boldsymbol{f}_\theta^{-1} :\mathbf{z} \in \mathbb{R}^D \rightarrow \mathbf{x} \in \mathbb{R}^D\). We can view \(\boldsymbol{f}_\theta\) as a destructive density whereby we “destroy” the density of the original dataset \(p_\text{data}(\mathbf{x})\) into a common base density \(p_\mathbf{z}\). Conversely, we can view \(g_\theta\) as a constructive density whereby we “construct” the density of the original dataset \(p_\text{data}(\mathbf{x})\) from a base density \(p_\mathbf{z}\).
We’re assuming \(\mathbf{z}\sim p_\mathbf{z}(\mathbf{z})\). Using the change of variables formula, we can express the probability of \(p_\theta(\mathbf{x})\) in terms of \(\mathbf{z}\) and the transform \(\boldsymbol{f}_\theta\).
This function \(\boldsymbol{f}_\theta\) “normalizes” the complex density \(\mathbf{x}\) into a simpler base distribution \(\mathbf{z}\). We can also express this equation in terms of \(g_\theta\) which is the standard found in the normalizing flow literature.
The function \(g_\theta\) pushes forward the base density \(\mathbf{z}\) to a more complex density \(\mathbf{x}\).
In this demonstration, we want to show that the following is equivalent.
This says that the KL-Divergence between the data distribution \(p_\text{data}(\mathbf{x})\) and the model \(p_\mathbf{x}(\mathbf{x};\theta)\) is equivalent to the KL-Divergence between \textit{induced} distribution \(p_\text{target}(\mathbf{z};\theta)\) from the transformation \(\boldsymbol{f}_\theta(\mathbf{x})\) and the chosen base distribution \(p_\mathbf{z}(\mathbf{z})\).
Proof
First we deconstruct the KL-Divergence term into its log components.
If we expand \(p_\mathbf{x}(\mathbf{x};\theta)\) with the change of variables formula.
Now we do a change of variables from the data distribution \(\mathbf{x}\) to the base distribution \(\mathbf{z}\).
Recognize that we have changed the expectations from the data to the induced distribution and all terms are wrt to \(\mathbf{z}\). So we can reduce this to:
where \(p_{\text{target}}(\mathbf{x})\) is the distribution of \(\mathbf{z}=\boldsymbol{f}_\theta(\mathbf{x})\) when \(\mathbf{x}\) is sampled from \(p_\text{data}(\mathbf{x})\). So this is simply the KL-Divergence between the transformed data in the latent space and the base distribution we choose:
which completes the proof.