Skip to article frontmatterSkip to article content

Sequential Variational Inference

CSIC
UCM
IGEO

Context

Let’s say we are given a sequence of measurements, yn\boldsymbol{y}_n.

D={yn}n=1Nt\mathcal{D} = \left\{ \boldsymbol{y}_n \right\}_{n=1}^{N_t}

We assume that there is some latent state, zt\boldsymbol{z}_t, which enables the sequential measurements to be conditionally independent.


Joint Distribution

This represents how we decompose the time series. We use the properties mentioned above.

pθ(z0:T,y1:T)=pθ(z0)t=1Tpθ(ytzt)pθ(ztzt1)p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T}) = p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_0\right) \prod_{t=1}^T p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right) p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right)

Posterior

We are interested in finding the latent states, z0:T\boldsymbol{z}_{0:T}, given our observations, y1:T\boldsymbol{y}_{1:T}. However, due to the Markovian nature of the state space model, this process is a combination of
This is known as filtering.

pθ(zty1:t)=1Eθpθ(ztyt)pθ(zty1:t)p_{\boldsymbol{\theta}}(\boldsymbol{z}_t | \boldsymbol{y}_{1:t}) = \frac{1}{\boldsymbol{E}_{\boldsymbol{\theta}}} p_{\boldsymbol{\theta}}(\boldsymbol{z}_t|\boldsymbol{y}_t) p_{\boldsymbol{\theta}}(\boldsymbol{z}_t|\boldsymbol{y}_{1:t})

where the marginal likelihood, Eθ\boldsymbol{E}_{\boldsymbol{\theta}}, is given by

Eθ=pθ(yty1:t1)=pθ(ytzt)pθ(zty1:t1)dzt\boldsymbol{E}_{\boldsymbol{\theta}} = p_{\boldsymbol{\theta}}(\boldsymbol{y}_{t}|\boldsymbol{y}_{1:t-1}) = \int p_{\boldsymbol{\theta}}(\boldsymbol{y}_t|\boldsymbol{z}_t) p_{\boldsymbol{\theta}}(\boldsymbol{z}_t|\boldsymbol{y}_{1:t-1})d\boldsymbol{z}_t

This is typically given by the filtering algorithm which has a prediction and a correction step.

Prediction:pθ(zty1:t1)=pθ(ztzt1)p(zt1y1:t1)dzt1Correction:pθ(zty1t)=1Eθpθ(ytzt)pθ(zty1:t1)\begin{aligned} \text{Prediction}: && && p_{\boldsymbol{\theta}}(\boldsymbol{z}_t|\boldsymbol{y}_{1:t-1}) &= \int p_{\boldsymbol{\theta}}(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}) p(\boldsymbol{z}_{t-1}|\boldsymbol{y}_{1:t-1})d\boldsymbol{z}_{t-1} \\ \text{Correction}: && && p_{\boldsymbol{\theta}}(\boldsymbol{z}_t|\boldsymbol{y}_{1-t}) &= \frac{1}{\boldsymbol{E}_{\boldsymbol{\theta}}} p_{\boldsymbol{\theta}}(\boldsymbol{y}_t|\boldsymbol{z}_t) p_{\boldsymbol{\theta}}(\boldsymbol{z}_t|\boldsymbol{y}_{1:t-1}) \end{aligned}

Variational Inference

We will start with the full posterior written like so

pθ(zty1:t)=pθ(z0:t,y1:t)pθ(yty1:t1)p_{\boldsymbol{\theta}}(\boldsymbol{z}_t | \boldsymbol{y}_{1:t}) = \frac{p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:t},\boldsymbol{y}_{1:t})}{p_{\boldsymbol{\theta}}(\boldsymbol{y}_{t}|\boldsymbol{y}_{1:t-1})}

but will rearrange this to have the marginal likelihood isolated

pθ(yty1:t1)=pθ(z0:t,y1:t)pθ(zty1:t)p_{\boldsymbol{\theta}}(\boldsymbol{y}_{t}|\boldsymbol{y}_{1:t-1}) = \frac{p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:t},\boldsymbol{y}_{1:t})} {p_{\boldsymbol{\theta}}(\boldsymbol{z}_t | \boldsymbol{y}_{1:t})}

Now, we will do the standard log transformation on both sides

logpθ(yty1:t1)=logpθ(z0:t,y1:t)pθ(zty1:t)\log p_{\boldsymbol{\theta}}(\boldsymbol{y}_{t}|\boldsymbol{y}_{1:t-1}) = \log \frac{p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:t},\boldsymbol{y}_{1:t})} {p_{\boldsymbol{\theta}}(\boldsymbol{z}_t | \boldsymbol{y}_{1:t})}

Then we will do the identity trick to push in our variational distribution, q(z1:t)q(\boldsymbol{z}_{1:t}).

logpθ(yty1:t1)=Eqϕ(z0:t)[logpθ(z0:t,y1:t)pθ(zty1:t)qϕ(z0:t)qϕ(z0:t)]\log p_{\boldsymbol{\theta}}(\boldsymbol{y}_{t}|\boldsymbol{y}_{1:t-1}) = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})}\left[ \log \frac{p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:t},\boldsymbol{y}_{1:t})} {p_{\boldsymbol{\theta}}(\boldsymbol{z}_t | \boldsymbol{y}_{1:t})} \frac{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})}{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})} \right]

Now, we can break apart the log terms

logpθ(yty1:t1)=Eqϕ(z0:t)[logpθ(z0:t,y1:t)qϕ(z0:t)+logqϕ(z0:t)pθ(zty1:t)]\log p_{\boldsymbol{\theta}}(\boldsymbol{y}_{t}|\boldsymbol{y}_{1:t-1}) = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})}\left[ \log \frac{p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:t},\boldsymbol{y}_{1:t})} {q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})} + \log \frac{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})} {p_{\boldsymbol{\theta}}(\boldsymbol{z}_t | \boldsymbol{y}_{1:t})} \right]

and we can seperate the expectation terms as they are additive

logpθ(yty1:t1)=Eqϕ(z1:t)[logpθ(z0:t,y1:t)qϕ(z1:t)]+Eqϕ(z1:t)[logqϕ(z1:t)pθ(zty1:t)]\log p_{\boldsymbol{\theta}}(\boldsymbol{y}_{t}|\boldsymbol{y}_{1:t-1}) = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{1:t})}\left[ \log \frac{p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:t},\boldsymbol{y}_{1:t})} {q_{\boldsymbol{\phi}}(\boldsymbol{z}_{1:t})} \right] + \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{1:t})}\left[ \log \frac{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{1:t})} {p_{\boldsymbol{\theta}}(\boldsymbol{z}_t | \boldsymbol{y}_{1:t})} \right]

The 2nd term on the RHS is the KLD term which we can replace this with the more compact form.

logpθ(x)=Eqϕ(z0:t)[logpθ(z0:t,y1:t)qϕ(z0:t)]+DKL[qϕ(z0:t)pθ(zty1:t)]\log p_\theta(x) = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})}\left[ \log \frac{p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:t},\boldsymbol{y}_{1:t})} {q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})} \right] + \text{D}_{\text{KL}} \left[ q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t}) || p_{\boldsymbol{\theta}}(\boldsymbol{z}_t | \boldsymbol{y}_{1:t}) \right]

The term on the right is now the variational gap term. We know that this will always be greater than or equal to 0. So we need to maximize the first term in order to minimize the second term, i.e., minimize the variational gap.

Thus, we can drop that term and put a lower bound on the likelihood.

We can decompose the joint distribution within the first term

LELBO:=Eqϕ(z0:t)[logpθ(z0:t,y1:t)qϕ(z0:t)]logpθ(zty1:t)\boldsymbol{L}_\text{ELBO} := \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})}\left[ \log \frac{p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:t},\boldsymbol{y}_{1:t})} {q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})} \right] \leq \log p_{\boldsymbol{\theta}}(\boldsymbol{z}_t | \boldsymbol{y}_{1:t})

To clean this term up, first we will split the term using the log rules

LELBO:=Eqϕ(z0:T)[logpθ(z0:T,y1:T)logqϕ(z0:T)]\boldsymbol{L}_\text{ELBO} := \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:T})} \left[ \log p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T}) - \log q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:T}) \right]

Now, we will decompose the joint distribution based on our priors.

LELBO:=Eqϕ(zt1)[t=1Tlogpθ(ytzt)+t=1Tlogpθ(ztzt1)t=1Tlogqϕ(zt1)]\boldsymbol{L}_\text{ELBO} := \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1})} \left[ \sum_{t=1}^T\log p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right) + \sum_{t=1}^T\log p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right) - \sum_{t=1}^T \log q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1}) \right]

We can push the summations outside of the logs and expectations

LELBO:=t=1TEqϕ(zt1)[logpθ(ytzt)+logpθ(ztzt1)logqϕ(zt1)]\boldsymbol{L}_\text{ELBO} := \sum_{t=1}^T \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1})} \left[ \log p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right) + \log p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right) - \log q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1}) \right]

Similar to our other derivations of the Variational distribution, we will also have 3 different terms depending upon how we break this apart.


Variational Free Energy (VFE)

There is one more main derivation that remains (that’s often seen in the literature). Looking at the equation (16) again we will isolate the likelihood and the prior under the variational expectation. This gives us:

LELBO=t=1TEqϕ(zt)[logpθ(ytzt)pθ(ztzt1)]t=1TEqϕ(zt1)[logqϕ(zt1)].\mathcal{L}_{\text{ELBO}}= {\color{red} \sum_{t=1}^T \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)} \left[ \log p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right) p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right) \right]} - {\color{green} \sum_{t=1}^T \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1})} \left[ \log q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1})\right] }.

where:

  • Eqϕ(zt)[logpθ(ytzt)pθ(ztzt1)]{\color{red}\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)}\left[ \log p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right)p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right)\right]} - is the energy{\color{red}\text{energy}} function
  • Eqϕ(zt)[logqϕ(zt)]{\color{green} \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)}\left[ \log q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t})\right]} - is the entropy{\color{green}\text{entropy}}

Source: I see this approach a lot in the Gaussian process literature when they are deriving the Sparse Gaussian Process from Titsias.


Reconstruction Loss

This is the most common loss. Looking at equation (16) again, we group the prior probability and the variational distribution together, we get:

LELBO:=t=1TEqϕ(zt)[logpθ(ytzt)]+t=1TEqϕ(zt1)[logpθ(ztzt1)qϕ(zt1)]\boldsymbol{L}_\text{ELBO} := \sum_{t=1}^T \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)} \left[ \log p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right) \right] + \sum_{t=1}^T \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1})} \left[ \log \frac{p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right)} {q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1})} \right]

This is the same KLD term as before but in the reverse order. So with a slight of hand in terms of the signs, we can rearrange the term to be

LELBO:=t=1TEqϕ(zt)[logpθ(ytzt)]t=1TEqϕ(zt)[logqϕ(zt)pθ(ztzt1)]\boldsymbol{L}_\text{ELBO} := \sum_{t=1}^T \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)} \left[ \log p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right) \right] - \sum_{t=1}^T \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)} \left[ \log \frac{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t})} {p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right)} \right]

So now, we have the exact same KLD term as before. So let’s use the simplified form.

LELBO=t=1TEqϕ(zt)[pθ(ytzt)]t=1TDKL[qϕ(zt1)pθ(ztzt1)].\boldsymbol{L}_{\text{ELBO}}= {\color{red} \sum_{t=1}^T \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)} \left[ p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right)\right]} - {\color{green} \sum_{t=1}^T\text{D}_\text{KL}\left[ q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1})||p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right) \right]}.

where:

  • Eqϕ(zt)[pθ(ytzt)]{\color{red}\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)}\left[ p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right)\right]} - is the reconstruction loss\color{red}\text{reconstruction loss}.
  • DKL[qϕ(zt1)pθ(ztzt1)]{\color{green}\text{D}_\text{KL}\left[q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1})||p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right)\right]} - is the complexity, i.e. the KL divergence\color{green}\text{KL divergence} (a distance metric) between the prior and the variational distribution.

This is easily the most common ELBO term especially with Variational AutoEncoders (VAEs). The first term is the expectation of the likelihood term wrt the variational distribution. The second term is the KLD between the prior and the variational distribution.


Volume Correction

Another approach is more along the lines of the transform distribution. Assume we have our original data domain X\mathcal{X} and we have some stochastic transformation, p(z|x), which transforms the data from our original domain to a transform domain, Z\mathcal{Z}.

zp(zx)z \sim p(z|x)

To acquire this from equation (16), we will isolate the prior and combine the likelihood and the variational distribution.

LELBO=t=1TEqϕ(zt1)[logpθ(ztzt1)]+t=1TEqϕ(zt)[logpθ(ytzt)qϕ(zt)].\boldsymbol{L}_{\text{ELBO}}= {\color{red} \sum_{t=1}^T \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1})} \left[ \log p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right) \right]} + {\color{green} \sum_{t=1}^T \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)} \left[ \log \frac{p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right)} {q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t})} \right]}.

where:

  • Eqϕ(zt)[logpθ(ztzt1)]{\color{red}\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)}\left[ \log p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right) \right]} - is the expectation of the transformed distribution, aka the reparameterized probability{\color{red}\text{reparameterized probability}}.
  • Eqϕ(zt)[logpθ(ytzt)qϕ(zt)]{\color{green}\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)}\left[ \log \frac{p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right)}{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t})} \right]} - is the ratio between the inverse transform and the forward transform , i.e. Volume Correction Factor{\color{green}\text{Volume Correction Factor}} or likelihood contribution.

Source: I first saw this approach in the SurVAE Flows paper.


Loss Function

We have the generic ELBO loss function calculates a loss between the joint variational distribution and the joint prior distribution.

LELBO(θ,ϕ)=Eqϕ(z0:T)[logpθ(z0:T,y1:T)logqϕ(z0:T)]\boldsymbol{L}_\text{ELBO}(\boldsymbol{\theta},\boldsymbol{\phi}) = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:T})} \left[ \log p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T}) - \log q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:T}) \right]

where the prior parameters, θ\boldsymbol{\theta}, and variational parameters, ϕ\boldsymbol{\phi}. So, we can calculate gradients

ϕ,θLELBO=ϕ,θEqϕ(z0:T)[logpθ(z0:T,y1:T)logqϕ(z0:T)]\boldsymbol{\nabla}_{\boldsymbol{\phi},\boldsymbol{\theta}}\boldsymbol{L}_\text{ELBO} = \boldsymbol{\nabla}_{\boldsymbol{\phi},\boldsymbol{\theta}} \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:T})} \left[ \log p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T}) - \log q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:T}) \right]

The terms in this equation cannot be calculated in closed form. So we must use some sort of Monte Carlo sampling routine

ϕ,θLELBO1Nn=1Nϕ,θEqϕ(z0:T(n))[logpθ(z0:T(n),y1:T)logqϕ(z0:T(n))]\boldsymbol{\nabla}_{\boldsymbol{\phi},\boldsymbol{\theta}}\boldsymbol{L}_\text{ELBO} \approx \frac{1}{N}\sum_{n=1}^{N} \boldsymbol{\nabla}_{\boldsymbol{\phi},\boldsymbol{\theta}} \mathbb{E}_{q_{\boldsymbol{\phi}}\left(\boldsymbol{z}_{0:T}^{(n)}\right)} \left[ \log p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_{0:T}^{(n)},\boldsymbol{y}_{1:T}\right) - \log q_{\boldsymbol{\phi}}\left(\boldsymbol{z}_{0:T}^{(n)}\right) \right]

where z0:T(n)\boldsymbol{z}^{(n)}_{0:T} are samples from the latent state. There are some difficulties regarding calculating gradients over expectations. See the pyro-ppl guide for more information about this.


Variational Distributions

There are many ways one could

  • Independent
  • Markovian
  • Autoregressive
  • Bi-Directional

Independent

This first case is the simplest. We assume that the state does not depend upon anything. An example formulation can be given by:

q(z1:T,y1:T)=t=1TN(ztmϕ,Sϕ)q(\boldsymbol{z}_{1:T},\boldsymbol{y}_{1:T}) = \prod_{t=1}^T \mathcal{N} \left(\boldsymbol{z}_t| \boldsymbol{m}_{\boldsymbol{\phi}}, \boldsymbol{S}_{\boldsymbol{\phi}} \right)

Conditional

This first case is the simplest. We assume that the state only depends upon the observations, i.e., ztq(ztyt)z_t \sim q(\boldsymbol{z}_t|\boldsymbol{y}_t). However, we allow for a non-linear relationship between the observations, yt\boldsymbol{y}_t, and the state, zt\boldsymbol{z}_t. An example formulation can be given by:

q(z1:T,y1:T)=t=1TN(ztm(yt;ϕ),S(yt;ϕ))q(\boldsymbol{z}_{1:T},\boldsymbol{y}_{1:T}) = \prod_{t=1}^T \mathcal{N} \left(\boldsymbol{z}_t| \boldsymbol{m}(\boldsymbol{y}_t;\boldsymbol{\phi}), \boldsymbol{S}(\boldsymbol{y}_t;\boldsymbol{\phi}) \right)

This distribution captures the independent nature between the states, p(zt,z1:t1)=p(zt)p(\boldsymbol{z}_t,\boldsymbol{z}_{1:t-1}) = p(\boldsymbol{z}_t).


Markovian

Another option is to do a linear transformation of the previous state and the current observation.

q(z0:T,y1:T)=N(z0μ0,Σ)t=1TN(ztm(yt,zt1;ϕ),S(yt,zt1;ϕ))q(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T}) = \mathcal{N}(\boldsymbol{z}_0|\boldsymbol{\mu}_0,\boldsymbol{\Sigma}) \prod_{t=1}^T \mathcal{N}\left(\boldsymbol{z}_t| \boldsymbol{m}(\boldsymbol{y}_t,\boldsymbol{z}_{t-1};\boldsymbol{\phi}), \boldsymbol{S}(\boldsymbol{y}_t,\boldsymbol{z}_{t-1};\boldsymbol{\phi}) \right)

This distribution captures the Markovian nature between states, p(zt,z1:t1)=p(ztzt1)p(\boldsymbol{z}_t,\boldsymbol{z}_{1:t-1}) = p(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}).


Autoregressive

q(z0:T,y1:T)=N(z0μ0,Σ)t=1TN(ztm(yt,z1:t1;ϕ),S(yt,z1:t1;ϕ))q(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T}) = \mathcal{N}(\boldsymbol{z}_0|\boldsymbol{\mu}_0,\boldsymbol{\Sigma}) \prod_{t=1}^T \mathcal{N}\left(\boldsymbol{z}_t| \boldsymbol{m}(\boldsymbol{y}_t,\boldsymbol{z}_{1:t-1};\boldsymbol{\phi}), \boldsymbol{S}(\boldsymbol{y}_t,\boldsymbol{z}_{1:t-1};\boldsymbol{\phi}) \right)

This distribution captures the auto-regressive nature between the states, p(zt,z1:t1)=p(ztz1:t1)p(\boldsymbol{z}_t,\boldsymbol{z}_{1:t-1}) = p(\boldsymbol{z}_t|\boldsymbol{z}_{1:t-1}).


Bi-Directional

q(z0:T,y1:T)=N(z0μ0,Σ)t=1TN(ztm(yt,z1:T;ϕ),S(yt,z1:T;ϕ))q(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T}) = \mathcal{N}(\boldsymbol{z}_0|\boldsymbol{\mu}_0,\boldsymbol{\Sigma}) \prod_{t=1}^T \mathcal{N}\left(\boldsymbol{z}_t| \boldsymbol{m}(\boldsymbol{y}_t,\boldsymbol{z}_{1:T};\boldsymbol{\phi}), \boldsymbol{S}(\boldsymbol{y}_t,\boldsymbol{z}_{1:T};\boldsymbol{\phi}) \right)

This distribution captures the auto-regressive nature between the states, p(zt,z1:T)=p(ztz1:T)p(\boldsymbol{z}_t,\boldsymbol{z}_{1:T}) = p(\boldsymbol{z}_t|\boldsymbol{z}_{1:T}).


Latent Encoders

μht,Σht=T(y1:T;ϕ)\boldsymbol{\mu_h}_t, \boldsymbol{\Sigma_h}_t = \boldsymbol{T}(\boldsymbol{y}_{1:T};\boldsymbol{\phi})

Now, we can redo each of the above methods using this encoder structure.


Conditionally Independent Observations

Data Encoder:μht,Σht=T(y1:T;ϕ)Variational:q(z1:T,y1:T)=t=1TN(ztμht,Σht)\begin{aligned} \text{Data Encoder}: && && \boldsymbol{\mu_h}_t, \boldsymbol{\Sigma_h}_t &= \boldsymbol{T}(\boldsymbol{y}_{1:T};\boldsymbol{\phi})\\ \text{Variational}: && && q(\boldsymbol{z}_{1:T},\boldsymbol{y}_{1:T}) &= \prod_{t=1}^T \mathcal{N} \left(\boldsymbol{z}_t|\boldsymbol{\mu_h}_t,\boldsymbol{\Sigma_h}_t\right) \end{aligned}

This is referred to as the RNN Mean-Field encoder because the outputs are independent of the posterior.


Markovian

Data Encoder:μθt,σθt=T(y1:T;ϕ)Variational:q(z1:T,y1:T)=t=1TN(ztm(zt1;μθt),S(zt1;σθt))\begin{aligned} \text{Data Encoder}: && && \boldsymbol{\mu_\theta}_t, \boldsymbol{\sigma_\theta}_t &= \boldsymbol{T}(\boldsymbol{y}_{1:T};\boldsymbol{\phi})\\ \text{Variational}: && && q(\boldsymbol{z}_{1:T},\boldsymbol{y}_{1:T}) &= \prod_{t=1}^T \mathcal{N} \left(\boldsymbol{z}_t| \boldsymbol{m}(\boldsymbol{z}_{t-1};\boldsymbol{\mu_\theta}_t), \boldsymbol{S}(\boldsymbol{z}_{t-1};\boldsymbol{\sigma_\theta}_t) \right) \end{aligned}

This acts as a type of hyper-network whereby the weights of the variational distribution function are given by a another neural network, the RNN.


Autoregressive

Data Encoder:μθt,σθt=T(y1:T;ϕ)Variational:q(z1:T,y1:T)=t=1TN(ztm(z1:t1;μθt),S(z1:t1;σθt))\begin{aligned} \text{Data Encoder}: && && \boldsymbol{\mu_\theta}_t, \boldsymbol{\sigma_\theta}_t &= \boldsymbol{T}(\boldsymbol{y}_{1:T};\boldsymbol{\phi})\\ \text{Variational}: && && q(\boldsymbol{z}_{1:T},\boldsymbol{y}_{1:T}) &= \prod_{t=1}^T \mathcal{N} \left(\boldsymbol{z}_t| \boldsymbol{m}(\boldsymbol{z}_{1:t-1};\boldsymbol{\mu_\theta}_t), \boldsymbol{S}(\boldsymbol{z}_{1:t-1};\boldsymbol{\sigma_\theta}_t) \right) \end{aligned}