Sequential Variational Inference

Context¶

Let’s say we are given a sequence of measurements, $\boldsymbol{y}_n$ .

\mathcal{D} = \left\{ \boldsymbol{y}_n \right\}_{n=1}^{N_t}

(1)

We assume that there is some latent state, $\boldsymbol{z}_t$ , which enables the sequential measurements to be conditionally independent.

Joint Distribution¶

This represents how we decompose the time series. We use the properties mentioned above.

p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T}) = p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_0\right) \prod_{t=1}^T p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right) p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right)

(2)

Posterior¶

We are interested in finding the latent states, $\boldsymbol{z}_{0:T}$ , given our observations, $\boldsymbol{y}_{1:T}$ . However, due to the Markovian nature of the state space model, this process is a combination of
This is known as filtering.

p_{\boldsymbol{\theta}}(\boldsymbol{z}_t | \boldsymbol{y}_{1:t}) = \frac{1}{\boldsymbol{E}_{\boldsymbol{\theta}}} p_{\boldsymbol{\theta}}(\boldsymbol{z}_t|\boldsymbol{y}_t) p_{\boldsymbol{\theta}}(\boldsymbol{z}_t|\boldsymbol{y}_{1:t})

(3)

where the marginal likelihood, $\boldsymbol{E}_{\boldsymbol{\theta}}$ , is given by

\boldsymbol{E}_{\boldsymbol{\theta}} = p_{\boldsymbol{\theta}}(\boldsymbol{y}_{t}|\boldsymbol{y}_{1:t-1}) = \int p_{\boldsymbol{\theta}}(\boldsymbol{y}_t|\boldsymbol{z}_t) p_{\boldsymbol{\theta}}(\boldsymbol{z}_t|\boldsymbol{y}_{1:t-1})d\boldsymbol{z}_t

(4)

This is typically given by the filtering algorithm which has a prediction and a correction step.

\begin{aligned} \text{Prediction}: && && p_{\boldsymbol{\theta}}(\boldsymbol{z}_t|\boldsymbol{y}_{1:t-1}) &= \int p_{\boldsymbol{\theta}}(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}) p(\boldsymbol{z}_{t-1}|\boldsymbol{y}_{1:t-1})d\boldsymbol{z}_{t-1} \\ \text{Correction}: && && p_{\boldsymbol{\theta}}(\boldsymbol{z}_t|\boldsymbol{y}_{1-t}) &= \frac{1}{\boldsymbol{E}_{\boldsymbol{\theta}}} p_{\boldsymbol{\theta}}(\boldsymbol{y}_t|\boldsymbol{z}_t) p_{\boldsymbol{\theta}}(\boldsymbol{z}_t|\boldsymbol{y}_{1:t-1}) \end{aligned}

(5)

Variational Inference¶

We will start with the full posterior written like so

p_{\boldsymbol{\theta}}(\boldsymbol{z}_t | \boldsymbol{y}_{1:t}) = \frac{p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:t},\boldsymbol{y}_{1:t})}{p_{\boldsymbol{\theta}}(\boldsymbol{y}_{t}|\boldsymbol{y}_{1:t-1})}

(6)

but will rearrange this to have the marginal likelihood isolated

p_{\boldsymbol{\theta}}(\boldsymbol{y}_{t}|\boldsymbol{y}_{1:t-1}) = \frac{p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:t},\boldsymbol{y}_{1:t})} {p_{\boldsymbol{\theta}}(\boldsymbol{z}_t | \boldsymbol{y}_{1:t})}

(7)

Now, we will do the standard log transformation on both sides

\log p_{\boldsymbol{\theta}}(\boldsymbol{y}_{t}|\boldsymbol{y}_{1:t-1}) = \log \frac{p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:t},\boldsymbol{y}_{1:t})} {p_{\boldsymbol{\theta}}(\boldsymbol{z}_t | \boldsymbol{y}_{1:t})}

(8)

Then we will do the identity trick to push in our variational distribution, $q(\boldsymbol{z}_{1:t})$ .

\log p_{\boldsymbol{\theta}}(\boldsymbol{y}_{t}|\boldsymbol{y}_{1:t-1}) = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})}\left[ \log \frac{p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:t},\boldsymbol{y}_{1:t})} {p_{\boldsymbol{\theta}}(\boldsymbol{z}_t | \boldsymbol{y}_{1:t})} \frac{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})}{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})} \right]

(9)

Now, we can break apart the log terms

\log p_{\boldsymbol{\theta}}(\boldsymbol{y}_{t}|\boldsymbol{y}_{1:t-1}) = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})}\left[ \log \frac{p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:t},\boldsymbol{y}_{1:t})} {q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})} + \log \frac{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})} {p_{\boldsymbol{\theta}}(\boldsymbol{z}_t | \boldsymbol{y}_{1:t})} \right]

(10)

and we can seperate the expectation terms as they are additive

\log p_{\boldsymbol{\theta}}(\boldsymbol{y}_{t}|\boldsymbol{y}_{1:t-1}) = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{1:t})}\left[ \log \frac{p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:t},\boldsymbol{y}_{1:t})} {q_{\boldsymbol{\phi}}(\boldsymbol{z}_{1:t})} \right] + \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{1:t})}\left[ \log \frac{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{1:t})} {p_{\boldsymbol{\theta}}(\boldsymbol{z}_t | \boldsymbol{y}_{1:t})} \right]

(11)

The 2nd term on the RHS is the KLD term which we can replace this with the more compact form.

\log p_\theta(x) = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})}\left[ \log \frac{p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:t},\boldsymbol{y}_{1:t})} {q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})} \right] + \text{D}_{\text{KL}} \left[ q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t}) || p_{\boldsymbol{\theta}}(\boldsymbol{z}_t | \boldsymbol{y}_{1:t}) \right]

(12)

The term on the right is now the variational gap term. We know that this will always be greater than or equal to 0. So we need to maximize the first term in order to minimize the second term, i.e., minimize the variational gap.

Thus, we can drop that term and put a lower bound on the likelihood.

We can decompose the joint distribution within the first term

\boldsymbol{L}_\text{ELBO} := \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})}\left[ \log \frac{p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:t},\boldsymbol{y}_{1:t})} {q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})} \right] \leq \log p_{\boldsymbol{\theta}}(\boldsymbol{z}_t | \boldsymbol{y}_{1:t})

(13)

To clean this term up, first we will split the term using the log rules

\boldsymbol{L}_\text{ELBO} := \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:T})} \left[ \log p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T}) - \log q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:T}) \right]

(14)

Now, we will decompose the joint distribution based on our priors.

\boldsymbol{L}_\text{ELBO} := \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1})} \left[ \sum_{t=1}^T\log p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right) + \sum_{t=1}^T\log p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right) - \sum_{t=1}^T \log q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1}) \right]

(15)

We can push the summations outside of the logs and expectations

\boldsymbol{L}_\text{ELBO} := \sum_{t=1}^T \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1})} \left[ \log p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right) + \log p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right) - \log q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1}) \right]

(16)

Similar to our other derivations of the Variational distribution, we will also have 3 different terms depending upon how we break this apart.

Variational Free Energy (VFE)¶

There is one more main derivation that remains (that’s often seen in the literature). Looking at the equation (16) again we will isolate the likelihood and the prior under the variational expectation. This gives us:

\mathcal{L}_{\text{ELBO}}= {\color{red} \sum_{t=1}^T \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)} \left[ \log p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right) p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right) \right]} - {\color{green} \sum_{t=1}^T \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1})} \left[ \log q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1})\right] }.

(17)

where:

${\color{red}\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)}\left[ \log p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right)p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right)\right]}$ - is the ${\color{red}\text{energy}}$ function
${\color{green} \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)}\left[ \log q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t})\right]}$ - is the ${\color{green}\text{entropy}}$

Source: I see this approach a lot in the Gaussian process literature when they are deriving the Sparse Gaussian Process from Titsias.

Reconstruction Loss¶

This is the most common loss. Looking at equation (16) again, we group the prior probability and the variational distribution together, we get:

\boldsymbol{L}_\text{ELBO} := \sum_{t=1}^T \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)} \left[ \log p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right) \right] + \sum_{t=1}^T \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1})} \left[ \log \frac{p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right)} {q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1})} \right]

(18)

This is the same KLD term as before but in the reverse order. So with a slight of hand in terms of the signs, we can rearrange the term to be

\boldsymbol{L}_\text{ELBO} := \sum_{t=1}^T \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)} \left[ \log p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right) \right] - \sum_{t=1}^T \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)} \left[ \log \frac{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t})} {p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right)} \right]

(19)

So now, we have the exact same KLD term as before. So let’s use the simplified form.

\boldsymbol{L}_{\text{ELBO}}= {\color{red} \sum_{t=1}^T \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)} \left[ p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right)\right]} - {\color{green} \sum_{t=1}^T\text{D}_\text{KL}\left[ q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1})||p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right) \right]}.

(20)

where:

${\color{red}\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)}\left[ p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right)\right]}$ - is the $\color{red}\text{reconstruction loss}$ .
${\color{green}\text{D}_\text{KL}\left[q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1})||p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right)\right]}$ - is the complexity, i.e. the $\color{green}\text{KL divergence}$ (a distance metric) between the prior and the variational distribution.

This is easily the most common ELBO term especially with Variational AutoEncoders (VAEs). The first term is the expectation of the likelihood term wrt the variational distribution. The second term is the KLD between the prior and the variational distribution.

Volume Correction¶

Another approach is more along the lines of the transform distribution. Assume we have our original data domain $\mathcal{X}$ and we have some stochastic transformation, p(z|x), which transforms the data from our original domain to a transform domain, $\mathcal{Z}$ .

z \sim p(z|x)

(21)

To acquire this from equation (16), we will isolate the prior and combine the likelihood and the variational distribution.

\boldsymbol{L}_{\text{ELBO}}= {\color{red} \sum_{t=1}^T \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1})} \left[ \log p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right) \right]} + {\color{green} \sum_{t=1}^T \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)} \left[ \log \frac{p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right)} {q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t})} \right]}.

(22)

where:

${\color{red}\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)}\left[ \log p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right) \right]}$ - is the expectation of the transformed distribution, aka the ${\color{red}\text{reparameterized probability}}$ .
${\color{green}\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)}\left[ \log \frac{p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right)}{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t})} \right]}$ - is the ratio between the inverse transform and the forward transform , i.e. ${\color{green}\text{Volume Correction Factor}}$ or likelihood contribution.

Source: I first saw this approach in the SurVAE Flows paper.

Loss Function¶

We have the generic ELBO loss function calculates a loss between the joint variational distribution and the joint prior distribution.

\boldsymbol{L}_\text{ELBO}(\boldsymbol{\theta},\boldsymbol{\phi}) = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:T})} \left[ \log p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T}) - \log q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:T}) \right]

(23)

where the prior parameters, $\boldsymbol{\theta}$ , and variational parameters, $\boldsymbol{\phi}$ . So, we can calculate gradients

\boldsymbol{\nabla}_{\boldsymbol{\phi},\boldsymbol{\theta}}\boldsymbol{L}_\text{ELBO} = \boldsymbol{\nabla}_{\boldsymbol{\phi},\boldsymbol{\theta}} \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:T})} \left[ \log p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T}) - \log q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:T}) \right]

(24)

The terms in this equation cannot be calculated in closed form. So we must use some sort of Monte Carlo sampling routine

\boldsymbol{\nabla}_{\boldsymbol{\phi},\boldsymbol{\theta}}\boldsymbol{L}_\text{ELBO} \approx \frac{1}{N}\sum_{n=1}^{N} \boldsymbol{\nabla}_{\boldsymbol{\phi},\boldsymbol{\theta}} \mathbb{E}_{q_{\boldsymbol{\phi}}\left(\boldsymbol{z}_{0:T}^{(n)}\right)} \left[ \log p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_{0:T}^{(n)},\boldsymbol{y}_{1:T}\right) - \log q_{\boldsymbol{\phi}}\left(\boldsymbol{z}_{0:T}^{(n)}\right) \right]

(25)

where $\boldsymbol{z}^{(n)}_{0:T}$ are samples from the latent state. There are some difficulties regarding calculating gradients over expectations. See the pyro-ppl guide for more information about this.

Variational Distributions¶

There are many ways one could

Independent
Markovian
Autoregressive
Bi-Directional

Independent¶

This first case is the simplest. We assume that the state does not depend upon anything. An example formulation can be given by:

q(\boldsymbol{z}_{1:T},\boldsymbol{y}_{1:T}) = \prod_{t=1}^T \mathcal{N} \left(\boldsymbol{z}_t| \boldsymbol{m}_{\boldsymbol{\phi}}, \boldsymbol{S}_{\boldsymbol{\phi}} \right)

(26)

Conditional¶

This first case is the simplest. We assume that the state only depends upon the observations, i.e., $z_t \sim q(\boldsymbol{z}_t|\boldsymbol{y}_t)$ . However, we allow for a non-linear relationship between the observations, $\boldsymbol{y}_t$ , and the state, $\boldsymbol{z}_t$ . An example formulation can be given by:

q(\boldsymbol{z}_{1:T},\boldsymbol{y}_{1:T}) = \prod_{t=1}^T \mathcal{N} \left(\boldsymbol{z}_t| \boldsymbol{m}(\boldsymbol{y}_t;\boldsymbol{\phi}), \boldsymbol{S}(\boldsymbol{y}_t;\boldsymbol{\phi}) \right)

(27)

This distribution captures the independent nature between the states, $p(\boldsymbol{z}_t,\boldsymbol{z}_{1:t-1}) = p(\boldsymbol{z}_t)$ .

Markovian¶

Another option is to do a linear transformation of the previous state and the current observation.

q(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T}) = \mathcal{N}(\boldsymbol{z}_0|\boldsymbol{\mu}_0,\boldsymbol{\Sigma}) \prod_{t=1}^T \mathcal{N}\left(\boldsymbol{z}_t| \boldsymbol{m}(\boldsymbol{y}_t,\boldsymbol{z}_{t-1};\boldsymbol{\phi}), \boldsymbol{S}(\boldsymbol{y}_t,\boldsymbol{z}_{t-1};\boldsymbol{\phi}) \right)

(28)

This distribution captures the Markovian nature between states, $p(\boldsymbol{z}_t,\boldsymbol{z}_{1:t-1}) = p(\boldsymbol{z}_t|\boldsymbol{z}_{t-1})$ .

Autoregressive¶

q(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T}) = \mathcal{N}(\boldsymbol{z}_0|\boldsymbol{\mu}_0,\boldsymbol{\Sigma}) \prod_{t=1}^T \mathcal{N}\left(\boldsymbol{z}_t| \boldsymbol{m}(\boldsymbol{y}_t,\boldsymbol{z}_{1:t-1};\boldsymbol{\phi}), \boldsymbol{S}(\boldsymbol{y}_t,\boldsymbol{z}_{1:t-1};\boldsymbol{\phi}) \right)

(29)

This distribution captures the auto-regressive nature between the states, $p(\boldsymbol{z}_t,\boldsymbol{z}_{1:t-1}) = p(\boldsymbol{z}_t|\boldsymbol{z}_{1:t-1})$ .

Bi-Directional¶

q(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T}) = \mathcal{N}(\boldsymbol{z}_0|\boldsymbol{\mu}_0,\boldsymbol{\Sigma}) \prod_{t=1}^T \mathcal{N}\left(\boldsymbol{z}_t| \boldsymbol{m}(\boldsymbol{y}_t,\boldsymbol{z}_{1:T};\boldsymbol{\phi}), \boldsymbol{S}(\boldsymbol{y}_t,\boldsymbol{z}_{1:T};\boldsymbol{\phi}) \right)

(30)

This distribution captures the auto-regressive nature between the states, $p(\boldsymbol{z}_t,\boldsymbol{z}_{1:T}) = p(\boldsymbol{z}_t|\boldsymbol{z}_{1:T})$ .