Variational Inference#


Motivations#

Variational inference is the most scalable inference method the machine learning community has (as of 2019).

Ultimately, we are interested in approximating the marginal distribution of our data, \(\mathcal{X}\).

\[ \mathbf{x} \in \mathcal{X}\sim \mathbb{P}_* \]

We write some sort of approximation of the true (or best) underlying distribution via some parameterized form like so

\[ p_*(\mathbf{x}) \approx p_{\boldsymbol \theta}(\mathbf{x}). \]

However, in order to obtain this, we need to assume some latent variable, \(\mathbf{z}\), plays a role in estimating the underlying density. In the simplest form, we assume a generative model for the joint distribution can be written as

\[ p_\theta(z, x) = p(x|z)p(z) \]

When fitting a model, we are interested in maximizing the marginal likelihood

\[ p_\theta(x) = \int p_\theta(x|z)p_\theta(z)dz \]

However, this quantity is intractable because we have a non-linear function thats within an integral. So we use an variational distribution, \(q_\phi(z|x)\), (sometimes called an encoder).

\[ \log p_\theta(x) = \mathbb{E}_{q_\phi(z|x)}\left[ \log p_\theta(x) \right] \]

Pros and Cons#

These were taken from the slides of Shakir Mohamed (prob methods MLSS 2019)

Why Variational Inference?#

  • Applicable to all probabilistic models

  • Transforms a problem from integration to one of optimization

  • Convergence assessment

  • Principled and Scalable approach to model selection

  • Compact representation of posterior distribution

  • Faster to converge

  • Numerically stable

  • Modern Computing Architectures (GPUs)

  • There is a LOT of research already!

Why Not Variational Inference?#

  • Approximate posterior only

  • Difficulty in optimization due to local minima

  • Under-estimates the variance of posterior

  • Limited theory and guarantees for variational mehtods


Variational Distribution#

We defined the variationa distribution as \(q(z|x)\). However, we have many types of variational distributions we can impose. For example, we have some of the following:

  • Gaussian, \(q(z)\)

  • Mixture Distribution, \(\sum_{k}^{K}\pi_k \mathbb{P}\)

  • Bijective Transform (Flow), \(q(z|\tilde{z})\)

  • Stochastic Transform (Encoder, Amortized), \(q(z|x)\)

  • Conditional, \(q(z|x,y)\)

Below we will go through each of them and outline some potential strengths and weaknesses of each of the methods.


Simple, \(q(z)\)#

This is the simplest case where we often assume a very simple distribution can describe the distribution.

\[ q(z) = \mathcal{N}(z|\boldsymbol{\mu_\theta},\boldsymbol{\Sigma_\theta}) \]

If we take each of the Gaussian parameters as full matrices, we end up with:

\[ \boldsymbol{\mu_\theta}:=\boldsymbol{\mu} \in \mathbb{R}^D, \hspace{5mm} \boldsymbol{\Sigma_\theta}:=\boldsymbol{\Sigma} \in \mathbb{R}^{D\times D}; \]

For very high dimensional problems, these are a lot of parameters to learn. Now, we can have various simplifications (or complications) with this. For example, we can simplify the mean, \(\boldsymbol{\mu}\), to be zero. The majority of the changes will come from the covariance. Here are a few modifications.

Full Covariance

This is when we parameterize our covariance to be a full covariance matrix. \(\boldsymbol{\Sigma_\theta} := \boldsymbol{\Sigma}\). This is easily the most expensive and the most complex of the Gaussian types.

Lower Cholesky

We can also parameterize our covariance to be a lower triangular matrix, i.e. \(\boldsymbol{\Sigma_\theta} := \mathbf{L}\), that satisfies the cholesky decomposition, i.e. \(\mathbf{LL}^\top = \boldsymbol{\Sigma}\). This reduces the number of parameters of the full covariance by a factor. It also has desireable properties when parameterizing covariance matrices that are computationally attractive, e.g. positive definite.

Diagonal Covariance

We can parameterize our covariance matrix to be a diagonal, i.e. \(\boldsymbol{\Sigma_\theta} := \text{diag}(\boldsymbol{\sigma})\). This is a very drastic simplification of our model which limits the expressivity. However, there are immense computational benefits For example, a d-dimensional multivariate Gaussian rv with a mean and a diagonal covariance is the same as the product of \(d\) univeriate Gaussians.

\[ q(z) = \mathcal{N}\left(\boldsymbol{\mu_\theta}, \text{diag}(\boldsymbol{\sigma_\theta})\right) = \prod_{d}^D \mathcal{N}(\mu_d, \sigma_d ) \]

This is also known as the mean-field approximation and it is a very common starting point in practical VI algorithms.

Low Rank Multivariate Normal

Another parameterization is a low rank matrix with a diagonal matrix, i.e. \(\boldsymbol{\Sigma_\theta} := \mathbf{W}\mathbf{W}^\top + \mathbf{D}\) where \(\mathbf{W} \in \mathbb{R}^{D\times d}, \mathbf{D} \in \mathbb{R}^{D\times D}\). We assume that our parameterization can be low dimensional which might be appropriate for some applications. This allows for some computationally efficient schemes that make use of the Woodbury Identity and the matrix determinant lemma.

Orthogonal Decoupled

One interesting approach is to map the variational parameters via a subspace parameterization. For exaple, we can define the mean and variance like so:

\[\begin{split} \begin{aligned} \boldsymbol{\mu_\theta} &= \boldsymbol{\Psi}_{\boldsymbol{\mu}} \mathbf{a} \\ \boldsymbol{\Sigma_\theta} &= \boldsymbol{\Psi}_{\boldsymbol{\Sigma}} \mathbf{A} \boldsymbol{\Psi}_{\boldsymbol{\Sigma}}^\top + \mathbf{I} \end{aligned} \end{split}\]

This is a bit of a spin off of the Low-Rank Multivariate Normal approach. However, this method takes care and provides a low-rank method for both the mean and the covariance. They argue that we would be able to put more computational effort in the mean function (computationally easy) and less computational effort for the covariance (computationally intensive).

Source: Orthogonally Decoupled Variational Gaussian Process - Salimbeni et al (2018)

Delta Distribution

This is probably the distribution with the least amount of parameters. We set the covariance matrix to \(0\), i.e. \(\boldsymbol{\Sigma_\theta}:=\mathbf{0}\), and we let all of the mass rest on mean points, \(\boldsymbol{\mu_\theta}:=\boldsymbol{\mu}=\mathbf{u}\).

\[ q(z) = \delta(z - \hat{z}) \]

Mixture Distribution#

The principal behind this is that a simple base distribution, e.g. Gaussian, is not expressive enough. However, a mixture of simple distributions, e.g. Mixture of Gaussians, will be more expressive. So the idea is to choose simple base distribution and replicate it \(k\) times. Then, we then do a normalized weighted summation of each component to produce our mixture distribution.

\[ q(z) = \sum_{k}^K\pi_k \mathbb{P}_k \]

where \(0 \leq \pi_k \leq 1\) and \(\sum_{k}^K\pi_k=1\). For example, we can use a Gaussian distribution

\[ p_\theta(z) = \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) \]

where \(\theta = \{\pi_k, \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k \}_k^K\) are potentially learned parameters.. And the mixture distribution will be

\[ q_{\boldsymbol \theta}(\mathbf{z}) = \sum_{k}^K \pi_k \mathcal{N}(\mathbf{z} |\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \]

Again, we are free to parameterize the covariances as flexible or restrictive as possible. For example we can have full, cholesky, low-rank or diagonal. In addition we can tie some of the parameters together. For example, we can have the same covariance matrix for every \(k^\text{th}\) component, e.g. \(\boldsymbol{\Sigma}_k=\boldsymbol{\Sigma}\). Even for VAEs, this becomes a prior distribution which has noticable improvement over the standard Gaussian prior.

Note: in principal, a mixture distribution is very powerful and has the ability to estimate any distribution, e.g. univariate with enough components. However, like with most problems, the issue is estimating the best parameters just from observations.


Bijective Transformation (Flow)#

It may be that the variational distribution, \(q\), is not sufficiently expressive enough even with the complex Gaussian parameterization and/or the mixture distribution. So another option is to use a bijective transformation to map the data from a simple base distribution, e.g. Gaussian, to a more complex distribution for our variational parameter, \(z\).

\[ \mathbf{z} = \boldsymbol{T_\phi}(\tilde{\mathbf{z}}) \]

We hope that the resulting variational distribution, \(q(z)\), acts a better approximation to the data. Because our transformation is bijective, we can

variational parameter, \(z\), to a simple base distribution st we ha $\( q(z) = p_e(\tilde{z})|\boldsymbol{\nabla}_\mathbf{z}\boldsymbol{T_\phi}^{-1}(\mathbf{z})| \)$

where \(|\boldsymbol{\nabla}_\mathbf{z} \cdot|\) is the determinant Jacobian of the transformation, \(\boldsymbol{T_\phi}\).


Stochastic Transformation (Encoder, Amortization)#

Another type of transformation is a stochastic transformation. This is given by \(q(z|x)\). In this case, we assume some non-linear. For example, a Gaussian distribution with a parameterized mean and variance via neural networks

\[ q(\mathbf{z}|\mathbf{x}) = \mathcal{N}\left(\boldsymbol{\mu_\phi}(\mathbf{x}), \boldsymbol{\sigma_\phi}(\mathbf{x})\right) \]

or more appropriately

\[ q(\mathbf{z}|\mathbf{x}) = \mathcal{N}\left(\boldsymbol{\mu}, \text{diag}(\exp (\boldsymbol{\sigma}^2_{\log}) )\right), \hspace{4mm} (\boldsymbol{\mu}, \boldsymbol{\sigma}^2_{\log}) = \text{NN}_{\boldsymbol \theta}(\mathbf{x}) \]

It can be very difficult to try and have a variational distribution that is complicated enough to cover the whole posterior. So often, we use a variational distribution that is conditioned on the observations, i.e. \(q(z|x)\). This is known as an encoder because we encode the observations to obey th


ELBO (Encoder) - Derivation#

This derivation comes from the book by Probabilistic Machine Learning by Kevin Murphy. I find it to be a much better and intuitive derivation.

Note: I put the encoder tag in the title. This is because there are other ELBOs that have different purposes, for example, variational distributions without an encoder and also an encoder for conditional likelihoods. In this first one, we will like at the ELBO derivation

As mentioned above, we are interested in expanding the expectation of the marginal likelihood wrt the encoder variational distribution

\[ \log p_\theta(x) = \mathbb{E}_{q_\phi(z|x)}\left[ \log p_\theta(x) \right] \]

We will do a bit of mathematical manipulation to expand this expectation. Firstly, we will start with Bayes rule:

\[ p_\theta(x) = \frac{p_\theta(z,x)}{p_\theta(z|x)} \]

Plugging this into our expectation gives us:

\[ \log p_\theta(x) = \mathbb{E}_{q_\phi(z|x)}\left[ \log \frac{p_\theta(z,x)}{p_\theta(z|x)} \right] \]

Now we will do the identity trick (multiply by \(\frac{1}{1}\) :) ) within the log term to incorporate the variational distribution, \(q_\phi\).

\[ \log p_\theta(x) = \mathbb{E}_{q_\phi(z|x)}\left[ \log \frac{p_\theta(z,x)q_\phi(z|x)}{p_\theta(z|x)q_\phi(z|x)} \right] = \mathbb{E}_{q_\phi(z|x)}\left[ \log \frac{p_\theta(z,x)q_\phi(z|x)}{q_\phi(z|x)p_\theta(z|x)} \right] \]

Using the log rules, we can split this fraction into two fractions;

\[ \log p_\theta(x) = \mathbb{E}_{q_\phi(z|x)}\left[ \log \frac{p_\theta(z,x)}{q_\phi(z|x)} + \log \frac{q_\phi(z|x)}{p_\theta(z|x)} \right] \]

Now, we can expand the expectation term across the additive operator

\[ \log p_\theta(x) = \mathbb{E}_{q_\phi(z|x)}\left[ \log \frac{p_\theta(z,x)}{q_\phi(z|x)} \right] + \mathbb{E}_{q_\phi(z|x)}\left[\log \frac{q_\phi(z|x)}{p_\theta(z|x)} \right] \]

Here, we notice that the second term is actually the Kullback-Leibler divergence term.

\[ \text{D}_{\text{KL}} [Q||P] = \mathbb{E}_Q\left[\log \frac{Q}{P} \right] = - \mathbb{E}_Q\left[\log \frac{P}{Q} \right] \]

so we can replace this with the more compact form.

\[ \log p_\theta(x) = \mathbb{E}_{q_\phi(z|x)}\left[ \log \frac{p_\theta(z,x)}{q_\phi(z|x)} \right] + \text{D}_{\text{KL}} \left[q_\phi(z|x)||p_\theta(z|x) \right] \]

We know from theory that the KL divergence term is always zero or positive. So this means that we can draw a bound on the first term in terms of the marginal log-likelihood.

\[ \mathcal{L}_{\text{ELBO}}:=\mathbb{E}_{q_\phi(z|x)}\left[ \log \frac{p_\theta(z,x)}{q_\phi(z|x)} \right] \leq \log p_\theta(x) \]

This term is called the Evidence Lower Bound (ELBO). So the objective is to maximize this term which will also minimize the KLD.

\[ \mathcal{L}_{\text{ELBO}}=\mathbb{E}_{q_\phi(z|x)}\left[ \log \frac{p_\theta(z,x)}{q_\phi(z|x)} \right] \]

So now, we can expand the joint distribution using Bayes rule, i.e. \(p(z,x)=p(x|z)p(z)\), to give us.

\[ \mathcal{L}_{\text{ELBO}}=\mathbb{E}_{q_\phi(z|x)}\left[ \log \frac{p(x|z)p(z)}{q_\phi(z|x)} \right] \]

We can also expand this fraction using the log rules,

\[ \mathcal{L}_{\text{ELBO}}=\mathbb{E}_{q_\phi(z|x)}\left[ \log p(x|z) + \log p(z) - \log q_\phi(z|x) \right]. \]

where:

  • \(q_\phi(z|x)\) - encoder network

  • \(p_\theta(x|z)\) - decoder network

  • \(p_\theta(z)\) - prior network

Now, we have some options on how we can group the likelihood, the prior and the variational distribution together and each of them will offer a slightly different interpretation and application.


Reconstruction Loss#

If we group the prior probability and the variational distribution together, we get:

\[ \mathcal{L}_{\text{ELBO}}=\mathbb{E}_{q_\phi(z|x)}\left[ \log p(x|z) \right] + \mathbb{E}_{q_\phi(z|x)}\left[ \log \frac{p(z)}{q_\phi(z|x)} \right]. \]

This is the same KLD term as before but in the reverse order. So with a slight of hand in terms of the signs, we can rearrange the term to be

\[ \mathcal{L}_{\text{ELBO}}= \mathbb{E}_{q_\phi(z|x)}\left[ \log p(x|z) \right] - \mathbb{E}_{q_\phi(z|x)}\left[ \log \frac{q_\phi(z|x)} {p(z)}\right]. \]

Proof:

\[ \mathbb{E}_q[ \log p - \log q] = - \mathbb{E}_q[\log q - \log p] = - \mathbb{E}_q[\log\frac{q}{p}] \]

QED.


So now, we have the exact same KLD term as before. So let’s use the simplified form.

\[ \mathcal{L}_{\text{ELBO}}={\color{blue}\mathbb{E}_{q_\phi(z|x)}\left[ \log p(x|z) \right]} - {\color{green}\text{D}_\text{KL}\left[q_\phi(z|x)||p(z)\right]}. \]

where:

  • \({\color{blue}\mathbb{E}_{q_\phi(z|x)}\left[ \log p(x|z) \right]}\) - is the \(\color{blue}\text{reconstruction loss}\).

  • \({\color{green}\text{D}_\text{KL}\left[q_\phi(z|x)||p(z)\right]}\) - is the complexity, i.e. the \(\color{green}\text{KL divergence}\) (a distance metric) between the prior and the variational distribution.

This is easily the most common ELBO term especially with Variational AutoEncoders (VAEs). The first term is the expectation of the likelihood term wrt the variational distribution. The second term is the KLD between the prior and the variational distribution.


Volume Correction#

Another approach is more along the lines of the transform distribution. Assume we have our original data domain \(\mathcal{X}\) and we have some stochastic transformation, p(z|x), which transforms the data from our original domain to a transform domain, \(\mathcal{Z}\).

\[ z \sim p(z|x) \]

To acquire this, let’s look at the equation again

\[ \mathcal{L}_{\text{ELBO}}=\mathbb{E}_{q_\phi(z|x)}\left[ \log p(x|z) + \log p(z) - \log q_\phi(z|x) \right]. \]

except this time we will isolate the prior and combine the likelihood and the variational distribution.

\[ \mathcal{L}_{\text{ELBO}}={\color{blue}\mathbb{E}_{q_\phi(z|x)}\left[ \log p(z) \right]} + {\color{green}\mathbb{E}_{q_\phi(z|x)}\left[ \log \frac{p(x|z)}{q_\phi(z|x)} \right]}. \]

where:

  • \({\color{blue}\mathbb{E}_{q_\phi(z|x)}\left[ \log p(z) \right]}\) - is the expectation of the transformed distribution, aka the \({\color{blue}\text{reparameterized probability}}\).

  • \({\color{green}\mathbb{E}_{q_\phi(z|x)}\left[ \log \frac{p(x|z)}{q_\phi(z|x)} \right]}\) - is the ratio between the inverse transform and the forward transform , i.e. \({\color{green}\text{Volume Correction Factor}}\) or likelihood contribution.

Source: I first saw this approach in the SurVAE Flows paper.


Variational Free Energy (VFE)#

There is one more main derivation that remains (that’s often seen in the literature). Looking at the equation again

\[ \mathcal{L}_{\text{ELBO}}=\mathbb{E}_{q_\phi(z|x)}\left[ \log p(x|z) + \log p(z) - \log q_\phi(z|x) \right], \]

we now isolate the likelihood and the prior under the variational expectation. This gives us:

\[ \mathcal{L}_{\text{ELBO}}={\color{blue}\mathbb{E}_{q_\phi(z|x)}\left[ \log p(x|z) p(z)\right]} - {\color{green} \mathbb{E}_{q_\phi(z|x)}\left[ \log q_\phi(z|x) \right]}. \]

where:

  • \({\color{blue}\mathbb{E}_{q_\phi(z|x)}\left[ \log p(x|z) p(z)\right]}\) - is the \({\color{blue}\text{energy}}\) function

  • \({\color{green} \mathbb{E}_{q_\phi(z|x)}\left[ \log q_\phi(z|x) \right]}\) - is the \({\color{green}\text{entropy}}\)

Source: I see this approach a lot in the Gaussian process literature when they are deriving the Sparse Gaussian Process from Titsias.


ELBO (Non-Encoder) - Derivation#

In all of these formulas, we have an encoder as our variational distribution, i.e. \(q(z|x)\), which seeks to amortize the inference. Sometimes this is not necessary and we can find a complicated enough variational distribution, i.e. \(q(z)\). This often happens in very simple models, e.g. \(y = \mathbf{Wx} + \mathbf{b} + \epsilon\)

So this will be a similar derivation as the above, however we will

\[ \log p_\theta(x) = \mathbb{E}_{q_\phi(z)}\left[ \log p_\theta(x) \right] \]

I am going to make some grandiose assumptions and skip ahead of the derivation. But I think it might be useful to think ahead and then work my backwards.


Reconstruction Loss

This is the easiest term to show because it shows up in many simpler applications when we have very simple models and we believe that

\[ \mathcal{L}_{\text{ELBO}}={\color{blue}\mathbb{E}_{q_\phi(z)}\left[ \log p(x|z) \right]} - {\color{green}\text{D}_\text{KL}\left[q_\phi(z)||p(z)\right]}. \]

Volume Correction

This doesn’t actually work because we need a transformation from \(x\) to \(z\).


Variational Free Energy

\[ \mathcal{L}_{\text{ELBO}}={\color{blue}\mathbb{E}_{q_\phi(z)}\left[ \log p(x|z) p(z)\right]} - {\color{green} \mathbb{E}_{q_\phi(z)}\left[ \log q_\phi(z) \right]}. \]