Sequential Variational Inference
Context ¶ Let’s say we are given a sequence of measurements, y n \boldsymbol{y}_n y n .
D = { y n } n = 1 N t \mathcal{D} = \left\{ \boldsymbol{y}_n \right\}_{n=1}^{N_t} D = { y n } n = 1 N t We assume that there is some latent state, z t \boldsymbol{z}_t z t , which enables the sequential measurements to be conditionally independent.
Joint Distribution ¶ This represents how we decompose the time series.
We use the properties mentioned above.
p θ ( z 0 : T , y 1 : T ) = p θ ( z 0 ) ∏ t = 1 T p θ ( y t ∣ z t ) p θ ( z t ∣ z t − 1 ) p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T}) =
p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_0\right)
\prod_{t=1}^T
p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right)
p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right) p θ ( z 0 : T , y 1 : T ) = p θ ( z 0 ) t = 1 ∏ T p θ ( y t ∣ z t ) p θ ( z t ∣ z t − 1 ) Posterior ¶ We are interested in finding the latent states, z 0 : T \boldsymbol{z}_{0:T} z 0 : T , given our observations, y 1 : T \boldsymbol{y}_{1:T} y 1 : T .
However, due to the Markovian nature of the state space model, this process is a combination of This is known as filtering .
p θ ( z t ∣ y 1 : t ) = 1 E θ p θ ( z t ∣ y t ) p θ ( z t ∣ y 1 : t ) p_{\boldsymbol{\theta}}(\boldsymbol{z}_t | \boldsymbol{y}_{1:t}) =
\frac{1}{\boldsymbol{E}_{\boldsymbol{\theta}}}
p_{\boldsymbol{\theta}}(\boldsymbol{z}_t|\boldsymbol{y}_t)
p_{\boldsymbol{\theta}}(\boldsymbol{z}_t|\boldsymbol{y}_{1:t}) p θ ( z t ∣ y 1 : t ) = E θ 1 p θ ( z t ∣ y t ) p θ ( z t ∣ y 1 : t ) where the marginal likelihood, E θ \boldsymbol{E}_{\boldsymbol{\theta}} E θ , is given by
E θ = p θ ( y t ∣ y 1 : t − 1 ) = ∫ p θ ( y t ∣ z t ) p θ ( z t ∣ y 1 : t − 1 ) d z t \boldsymbol{E}_{\boldsymbol{\theta}} =
p_{\boldsymbol{\theta}}(\boldsymbol{y}_{t}|\boldsymbol{y}_{1:t-1}) =
\int p_{\boldsymbol{\theta}}(\boldsymbol{y}_t|\boldsymbol{z}_t)
p_{\boldsymbol{\theta}}(\boldsymbol{z}_t|\boldsymbol{y}_{1:t-1})d\boldsymbol{z}_t E θ = p θ ( y t ∣ y 1 : t − 1 ) = ∫ p θ ( y t ∣ z t ) p θ ( z t ∣ y 1 : t − 1 ) d z t This is typically given by the filtering algorithm which has a prediction and a correction step.
Prediction : p θ ( z t ∣ y 1 : t − 1 ) = ∫ p θ ( z t ∣ z t − 1 ) p ( z t − 1 ∣ y 1 : t − 1 ) d z t − 1 Correction : p θ ( z t ∣ y 1 − t ) = 1 E θ p θ ( y t ∣ z t ) p θ ( z t ∣ y 1 : t − 1 ) \begin{aligned}
\text{Prediction}: && &&
p_{\boldsymbol{\theta}}(\boldsymbol{z}_t|\boldsymbol{y}_{1:t-1}) &=
\int p_{\boldsymbol{\theta}}(\boldsymbol{z}_t|\boldsymbol{z}_{t-1})
p(\boldsymbol{z}_{t-1}|\boldsymbol{y}_{1:t-1})d\boldsymbol{z}_{t-1} \\
\text{Correction}: && &&
p_{\boldsymbol{\theta}}(\boldsymbol{z}_t|\boldsymbol{y}_{1-t}) &=
\frac{1}{\boldsymbol{E}_{\boldsymbol{\theta}}}
p_{\boldsymbol{\theta}}(\boldsymbol{y}_t|\boldsymbol{z}_t)
p_{\boldsymbol{\theta}}(\boldsymbol{z}_t|\boldsymbol{y}_{1:t-1})
\end{aligned} Prediction : Correction : p θ ( z t ∣ y 1 : t − 1 ) p θ ( z t ∣ y 1 − t ) = ∫ p θ ( z t ∣ z t − 1 ) p ( z t − 1 ∣ y 1 : t − 1 ) d z t − 1 = E θ 1 p θ ( y t ∣ z t ) p θ ( z t ∣ y 1 : t − 1 ) Variational Inference ¶ We will start with the full posterior written like so
p θ ( z t ∣ y 1 : t ) = p θ ( z 0 : t , y 1 : t ) p θ ( y t ∣ y 1 : t − 1 ) p_{\boldsymbol{\theta}}(\boldsymbol{z}_t | \boldsymbol{y}_{1:t}) =
\frac{p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:t},\boldsymbol{y}_{1:t})}{p_{\boldsymbol{\theta}}(\boldsymbol{y}_{t}|\boldsymbol{y}_{1:t-1})} p θ ( z t ∣ y 1 : t ) = p θ ( y t ∣ y 1 : t − 1 ) p θ ( z 0 : t , y 1 : t ) but will rearrange this to have the marginal likelihood isolated
p θ ( y t ∣ y 1 : t − 1 ) = p θ ( z 0 : t , y 1 : t ) p θ ( z t ∣ y 1 : t ) p_{\boldsymbol{\theta}}(\boldsymbol{y}_{t}|\boldsymbol{y}_{1:t-1}) =
\frac{p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:t},\boldsymbol{y}_{1:t})}
{p_{\boldsymbol{\theta}}(\boldsymbol{z}_t | \boldsymbol{y}_{1:t})} p θ ( y t ∣ y 1 : t − 1 ) = p θ ( z t ∣ y 1 : t ) p θ ( z 0 : t , y 1 : t ) Now, we will do the standard log transformation on both sides
log p θ ( y t ∣ y 1 : t − 1 ) = log p θ ( z 0 : t , y 1 : t ) p θ ( z t ∣ y 1 : t ) \log p_{\boldsymbol{\theta}}(\boldsymbol{y}_{t}|\boldsymbol{y}_{1:t-1}) =
\log \frac{p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:t},\boldsymbol{y}_{1:t})}
{p_{\boldsymbol{\theta}}(\boldsymbol{z}_t | \boldsymbol{y}_{1:t})} log p θ ( y t ∣ y 1 : t − 1 ) = log p θ ( z t ∣ y 1 : t ) p θ ( z 0 : t , y 1 : t ) Then we will do the identity trick to push in our variational distribution, q ( z 1 : t ) q(\boldsymbol{z}_{1:t}) q ( z 1 : t ) .
log p θ ( y t ∣ y 1 : t − 1 ) = E q ϕ ( z 0 : t ) [ log p θ ( z 0 : t , y 1 : t ) p θ ( z t ∣ y 1 : t ) q ϕ ( z 0 : t ) q ϕ ( z 0 : t ) ] \log p_{\boldsymbol{\theta}}(\boldsymbol{y}_{t}|\boldsymbol{y}_{1:t-1}) =
\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})}\left[
\log
\frac{p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:t},\boldsymbol{y}_{1:t})}
{p_{\boldsymbol{\theta}}(\boldsymbol{z}_t | \boldsymbol{y}_{1:t})}
\frac{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})}{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})}
\right] log p θ ( y t ∣ y 1 : t − 1 ) = E q ϕ ( z 0 : t ) [ log p θ ( z t ∣ y 1 : t ) p θ ( z 0 : t , y 1 : t ) q ϕ ( z 0 : t ) q ϕ ( z 0 : t ) ] Now, we can break apart the log terms
log p θ ( y t ∣ y 1 : t − 1 ) = E q ϕ ( z 0 : t ) [ log p θ ( z 0 : t , y 1 : t ) q ϕ ( z 0 : t ) + log q ϕ ( z 0 : t ) p θ ( z t ∣ y 1 : t ) ] \log p_{\boldsymbol{\theta}}(\boldsymbol{y}_{t}|\boldsymbol{y}_{1:t-1}) =
\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})}\left[
\log
\frac{p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:t},\boldsymbol{y}_{1:t})}
{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})} +
\log
\frac{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})}
{p_{\boldsymbol{\theta}}(\boldsymbol{z}_t | \boldsymbol{y}_{1:t})}
\right] log p θ ( y t ∣ y 1 : t − 1 ) = E q ϕ ( z 0 : t ) [ log q ϕ ( z 0 : t ) p θ ( z 0 : t , y 1 : t ) + log p θ ( z t ∣ y 1 : t ) q ϕ ( z 0 : t ) ] and we can seperate the expectation terms as they are additive
log p θ ( y t ∣ y 1 : t − 1 ) = E q ϕ ( z 1 : t ) [ log p θ ( z 0 : t , y 1 : t ) q ϕ ( z 1 : t ) ] + E q ϕ ( z 1 : t ) [ log q ϕ ( z 1 : t ) p θ ( z t ∣ y 1 : t ) ] \log p_{\boldsymbol{\theta}}(\boldsymbol{y}_{t}|\boldsymbol{y}_{1:t-1}) =
\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{1:t})}\left[
\log
\frac{p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:t},\boldsymbol{y}_{1:t})}
{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{1:t})}
\right] +
\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{1:t})}\left[
\log \frac{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{1:t})}
{p_{\boldsymbol{\theta}}(\boldsymbol{z}_t | \boldsymbol{y}_{1:t})}
\right] log p θ ( y t ∣ y 1 : t − 1 ) = E q ϕ ( z 1 : t ) [ log q ϕ ( z 1 : t ) p θ ( z 0 : t , y 1 : t ) ] + E q ϕ ( z 1 : t ) [ log p θ ( z t ∣ y 1 : t ) q ϕ ( z 1 : t ) ] The 2nd term on the RHS is the KLD term which we can replace this with the more compact form.
log p θ ( x ) = E q ϕ ( z 0 : t ) [ log p θ ( z 0 : t , y 1 : t ) q ϕ ( z 0 : t ) ] + D KL [ q ϕ ( z 0 : t ) ∣ ∣ p θ ( z t ∣ y 1 : t ) ] \log p_\theta(x) =
\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})}\left[
\log
\frac{p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:t},\boldsymbol{y}_{1:t})}
{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})}
\right] +
\text{D}_{\text{KL}}
\left[
q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t}) ||
p_{\boldsymbol{\theta}}(\boldsymbol{z}_t | \boldsymbol{y}_{1:t})
\right] log p θ ( x ) = E q ϕ ( z 0 : t ) [ log q ϕ ( z 0 : t ) p θ ( z 0 : t , y 1 : t ) ] + D KL [ q ϕ ( z 0 : t ) ∣∣ p θ ( z t ∣ y 1 : t ) ] The term on the right is now the variational gap term.
We know that this will always be greater than or equal to 0.
So we need to maximize the first term in order to minimize the second term, i.e., minimize the variational gap.
Thus, we can drop that term and put a lower bound on the likelihood.
We can decompose the joint distribution within the first term
L ELBO : = E q ϕ ( z 0 : t ) [ log p θ ( z 0 : t , y 1 : t ) q ϕ ( z 0 : t ) ] ≤ log p θ ( z t ∣ y 1 : t ) \boldsymbol{L}_\text{ELBO} :=
\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})}\left[
\log
\frac{p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:t},\boldsymbol{y}_{1:t})}
{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:t})}
\right]
\leq
\log p_{\boldsymbol{\theta}}(\boldsymbol{z}_t | \boldsymbol{y}_{1:t}) L ELBO := E q ϕ ( z 0 : t ) [ log q ϕ ( z 0 : t ) p θ ( z 0 : t , y 1 : t ) ] ≤ log p θ ( z t ∣ y 1 : t ) To clean this term up, first we will split the term using the log rules
L ELBO : = E q ϕ ( z 0 : T ) [ log p θ ( z 0 : T , y 1 : T ) − log q ϕ ( z 0 : T ) ] \boldsymbol{L}_\text{ELBO} :=
\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:T})}
\left[
\log p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T}) -
\log q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:T})
\right] L ELBO := E q ϕ ( z 0 : T ) [ log p θ ( z 0 : T , y 1 : T ) − log q ϕ ( z 0 : T ) ] Now, we will decompose the joint distribution based on our priors.
L ELBO : = E q ϕ ( z t − 1 ) [ ∑ t = 1 T log p θ ( y t ∣ z t ) + ∑ t = 1 T log p θ ( z t ∣ z t − 1 ) − ∑ t = 1 T log q ϕ ( z t − 1 ) ] \boldsymbol{L}_\text{ELBO} :=
\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1})}
\left[
\sum_{t=1}^T\log
p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right)
+
\sum_{t=1}^T\log
p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right)
-
\sum_{t=1}^T
\log q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1})
\right] L ELBO := E q ϕ ( z t − 1 ) [ t = 1 ∑ T log p θ ( y t ∣ z t ) + t = 1 ∑ T log p θ ( z t ∣ z t − 1 ) − t = 1 ∑ T log q ϕ ( z t − 1 ) ] We can push the summations outside of the logs and expectations
L ELBO : = ∑ t = 1 T E q ϕ ( z t − 1 ) [ log p θ ( y t ∣ z t ) + log p θ ( z t ∣ z t − 1 ) − log q ϕ ( z t − 1 ) ] \boldsymbol{L}_\text{ELBO} :=
\sum_{t=1}^T
\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1})}
\left[
\log
p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right)
+
\log
p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right)
-
\log q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1})
\right] L ELBO := t = 1 ∑ T E q ϕ ( z t − 1 ) [ log p θ ( y t ∣ z t ) + log p θ ( z t ∣ z t − 1 ) − log q ϕ ( z t − 1 ) ] Similar to our other derivations of the Variational distribution, we will also have 3 different terms depending upon how we break this apart.
Variational Free Energy (VFE) ¶ There is one more main derivation that remains (that’s often seen in the literature). Looking at the equation (16) again we will isolate the likelihood and the prior under the variational expectation. This gives us:
L ELBO = ∑ t = 1 T E q ϕ ( z t ) [ log p θ ( y t ∣ z t ) p θ ( z t ∣ z t − 1 ) ] − ∑ t = 1 T E q ϕ ( z t − 1 ) [ log q ϕ ( z t − 1 ) ] . \mathcal{L}_{\text{ELBO}}=
{\color{red}
\sum_{t=1}^T
\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)}
\left[ \log
p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right)
p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right)
\right]} -
{\color{green}
\sum_{t=1}^T
\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1})}
\left[ \log q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1})\right]
}. L ELBO = t = 1 ∑ T E q ϕ ( z t ) [ l o g p θ ( y t ∣ z t ) p θ ( z t ∣ z t − 1 ) ] − t = 1 ∑ T E q ϕ ( z t − 1 ) [ l o g q ϕ ( z t − 1 ) ] . where:
E q ϕ ( z t ) [ log p θ ( y t ∣ z t ) p θ ( z t ∣ z t − 1 ) ] {\color{red}\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)}\left[ \log p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right)p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right)\right]} E q ϕ ( z t ) [ l o g p θ ( y t ∣ z t ) p θ ( z t ∣ z t − 1 ) ] - is the energy {\color{red}\text{energy}} energy function
E q ϕ ( z t ) [ log q ϕ ( z t ) ] {\color{green} \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)}\left[ \log q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t})\right]} E q ϕ ( z t ) [ l o g q ϕ ( z t ) ] - is the entropy {\color{green}\text{entropy}} entropy
Source : I see this approach a lot in the Gaussian process literature when they are deriving the Sparse Gaussian Process from Titsias.
Reconstruction Loss ¶ This is the most common loss. Looking at equation (16) again, we group the prior probability and the variational distribution together, we get:
L ELBO : = ∑ t = 1 T E q ϕ ( z t ) [ log p θ ( y t ∣ z t ) ] + ∑ t = 1 T E q ϕ ( z t − 1 ) [ log p θ ( z t ∣ z t − 1 ) q ϕ ( z t − 1 ) ] \boldsymbol{L}_\text{ELBO} :=
\sum_{t=1}^T
\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)}
\left[
\log
p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right)
\right]
+
\sum_{t=1}^T
\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1})}
\left[
\log
\frac{p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right)}
{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1})}
\right] L ELBO := t = 1 ∑ T E q ϕ ( z t ) [ log p θ ( y t ∣ z t ) ] + t = 1 ∑ T E q ϕ ( z t − 1 ) [ log q ϕ ( z t − 1 ) p θ ( z t ∣ z t − 1 ) ] This is the same KLD term as before but in the reverse order. So with a slight of hand in terms of the signs, we can rearrange the term to be
L ELBO : = ∑ t = 1 T E q ϕ ( z t ) [ log p θ ( y t ∣ z t ) ] − ∑ t = 1 T E q ϕ ( z t ) [ log q ϕ ( z t ) p θ ( z t ∣ z t − 1 ) ] \boldsymbol{L}_\text{ELBO} :=
\sum_{t=1}^T
\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)}
\left[
\log
p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right)
\right]
-
\sum_{t=1}^T
\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)}
\left[
\log
\frac{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t})}
{p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right)}
\right] L ELBO := t = 1 ∑ T E q ϕ ( z t ) [ log p θ ( y t ∣ z t ) ] − t = 1 ∑ T E q ϕ ( z t ) [ log p θ ( z t ∣ z t − 1 ) q ϕ ( z t ) ] So now, we have the exact same KLD term as before. So let’s use the simplified form.
L ELBO = ∑ t = 1 T E q ϕ ( z t ) [ p θ ( y t ∣ z t ) ] − ∑ t = 1 T D KL [ q ϕ ( z t − 1 ) ∣ ∣ p θ ( z t ∣ z t − 1 ) ] . \boldsymbol{L}_{\text{ELBO}}=
{\color{red}
\sum_{t=1}^T
\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)}
\left[ p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right)\right]} -
{\color{green}
\sum_{t=1}^T\text{D}_\text{KL}\left[
q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1})||p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right)
\right]}. L ELBO = t = 1 ∑ T E q ϕ ( z t ) [ p θ ( y t ∣ z t ) ] − t = 1 ∑ T D KL [ q ϕ ( z t − 1 ) ∣∣ p θ ( z t ∣ z t − 1 ) ] . where:
E q ϕ ( z t ) [ p θ ( y t ∣ z t ) ] {\color{red}\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)}\left[ p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right)\right]} E q ϕ ( z t ) [ p θ ( y t ∣ z t ) ] - is the reconstruction loss \color{red}\text{reconstruction loss} reconstruction loss .
D KL [ q ϕ ( z t − 1 ) ∣ ∣ p θ ( z t ∣ z t − 1 ) ] {\color{green}\text{D}_\text{KL}\left[q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1})||p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right)\right]} D KL [ q ϕ ( z t − 1 ) ∣∣ p θ ( z t ∣ z t − 1 ) ] - is the complexity, i.e. the KL divergence \color{green}\text{KL divergence} KL divergence (a distance metric) between the prior and the variational distribution.
This is easily the most common ELBO term especially with Variational AutoEncoders (VAEs). The first term is the expectation of the likelihood term wrt the variational distribution. The second term is the KLD between the prior and the variational distribution.
Volume Correction ¶ Another approach is more along the lines of the transform distribution. Assume we have our original data domain X \mathcal{X} X and we have some stochastic transformation, p(z|x), which transforms the data from our original domain to a transform domain, Z \mathcal{Z} Z .
z ∼ p ( z ∣ x ) z \sim p(z|x) z ∼ p ( z ∣ x ) To acquire this from equation (16) , we will isolate the prior and combine the likelihood and the variational distribution.
L ELBO = ∑ t = 1 T E q ϕ ( z t − 1 ) [ log p θ ( z t ∣ z t − 1 ) ] + ∑ t = 1 T E q ϕ ( z t ) [ log p θ ( y t ∣ z t ) q ϕ ( z t ) ] . \boldsymbol{L}_{\text{ELBO}}=
{\color{red}
\sum_{t=1}^T
\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t-1})}
\left[ \log p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right) \right]} +
{\color{green}
\sum_{t=1}^T
\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)}
\left[ \log
\frac{p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right)}
{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t})} \right]}. L ELBO = t = 1 ∑ T E q ϕ ( z t − 1 ) [ l o g p θ ( z t ∣ z t − 1 ) ] + t = 1 ∑ T E q ϕ ( z t ) [ l o g q ϕ ( z t ) p θ ( y t ∣ z t ) ] . where:
E q ϕ ( z t ) [ log p θ ( z t ∣ z t − 1 ) ] {\color{red}\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)}\left[ \log p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}\right) \right]} E q ϕ ( z t ) [ l o g p θ ( z t ∣ z t − 1 ) ] - is the expectation of the transformed distribution, aka the reparameterized probability {\color{red}\text{reparameterized probability}} reparameterized probability .
E q ϕ ( z t ) [ log p θ ( y t ∣ z t ) q ϕ ( z t ) ] {\color{green}\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_t)}\left[ \log \frac{p_{\boldsymbol{\theta}}\left(\boldsymbol{y}_t|\boldsymbol{z}_t\right)}{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t})} \right]} E q ϕ ( z t ) [ l o g q ϕ ( z t ) p θ ( y t ∣ z t ) ] - is the ratio between the inverse transform and the forward transform , i.e. Volume Correction Factor {\color{green}\text{Volume Correction Factor}} Volume Correction Factor or likelihood contribution .
Source : I first saw this approach in the SurVAE Flows paper.
Loss Function ¶ We have the generic ELBO loss function calculates a loss between the joint variational distribution and the joint prior distribution.
L ELBO ( θ , ϕ ) = E q ϕ ( z 0 : T ) [ log p θ ( z 0 : T , y 1 : T ) − log q ϕ ( z 0 : T ) ] \boldsymbol{L}_\text{ELBO}(\boldsymbol{\theta},\boldsymbol{\phi}) =
\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:T})}
\left[
\log p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T}) -
\log q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:T})
\right] L ELBO ( θ , ϕ ) = E q ϕ ( z 0 : T ) [ log p θ ( z 0 : T , y 1 : T ) − log q ϕ ( z 0 : T ) ] where the prior parameters, θ \boldsymbol{\theta} θ , and variational parameters, ϕ \boldsymbol{\phi} ϕ .
So, we can calculate gradients
∇ ϕ , θ L ELBO = ∇ ϕ , θ E q ϕ ( z 0 : T ) [ log p θ ( z 0 : T , y 1 : T ) − log q ϕ ( z 0 : T ) ] \boldsymbol{\nabla}_{\boldsymbol{\phi},\boldsymbol{\theta}}\boldsymbol{L}_\text{ELBO} =
\boldsymbol{\nabla}_{\boldsymbol{\phi},\boldsymbol{\theta}}
\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:T})}
\left[
\log p_{\boldsymbol{\theta}}(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T}) -
\log q_{\boldsymbol{\phi}}(\boldsymbol{z}_{0:T})
\right] ∇ ϕ , θ L ELBO = ∇ ϕ , θ E q ϕ ( z 0 : T ) [ log p θ ( z 0 : T , y 1 : T ) − log q ϕ ( z 0 : T ) ] The terms in this equation cannot be calculated in closed form.
So we must use some sort of Monte Carlo sampling routine
∇ ϕ , θ L ELBO ≈ 1 N ∑ n = 1 N ∇ ϕ , θ E q ϕ ( z 0 : T ( n ) ) [ log p θ ( z 0 : T ( n ) , y 1 : T ) − log q ϕ ( z 0 : T ( n ) ) ] \boldsymbol{\nabla}_{\boldsymbol{\phi},\boldsymbol{\theta}}\boldsymbol{L}_\text{ELBO}
\approx
\frac{1}{N}\sum_{n=1}^{N}
\boldsymbol{\nabla}_{\boldsymbol{\phi},\boldsymbol{\theta}}
\mathbb{E}_{q_{\boldsymbol{\phi}}\left(\boldsymbol{z}_{0:T}^{(n)}\right)}
\left[
\log p_{\boldsymbol{\theta}}\left(\boldsymbol{z}_{0:T}^{(n)},\boldsymbol{y}_{1:T}\right) -
\log q_{\boldsymbol{\phi}}\left(\boldsymbol{z}_{0:T}^{(n)}\right)
\right] ∇ ϕ , θ L ELBO ≈ N 1 n = 1 ∑ N ∇ ϕ , θ E q ϕ ( z 0 : T ( n ) ) [ log p θ ( z 0 : T ( n ) , y 1 : T ) − log q ϕ ( z 0 : T ( n ) ) ] where z 0 : T ( n ) \boldsymbol{z}^{(n)}_{0:T} z 0 : T ( n ) are samples from the latent state.
There are some difficulties regarding calculating gradients over expectations.
See the pyro-ppl guide for more information about this.
Variational Distributions ¶ There are many ways one could
Independent
Markovian
Autoregressive
Bi-Directional
Independent ¶ This first case is the simplest.
We assume that the state does not depend upon anything.
An example formulation can be given by:
q ( z 1 : T , y 1 : T ) = ∏ t = 1 T N ( z t ∣ m ϕ , S ϕ ) q(\boldsymbol{z}_{1:T},\boldsymbol{y}_{1:T}) =
\prod_{t=1}^T
\mathcal{N}
\left(\boldsymbol{z}_t|
\boldsymbol{m}_{\boldsymbol{\phi}},
\boldsymbol{S}_{\boldsymbol{\phi}}
\right) q ( z 1 : T , y 1 : T ) = t = 1 ∏ T N ( z t ∣ m ϕ , S ϕ ) Conditional ¶ This first case is the simplest.
We assume that the state only depends upon the observations, i.e., z t ∼ q ( z t ∣ y t ) z_t \sim q(\boldsymbol{z}_t|\boldsymbol{y}_t) z t ∼ q ( z t ∣ y t ) .
However, we allow for a non-linear relationship between the observations, y t \boldsymbol{y}_t y t , and the state, z t \boldsymbol{z}_t z t .
An example formulation can be given by:
q ( z 1 : T , y 1 : T ) = ∏ t = 1 T N ( z t ∣ m ( y t ; ϕ ) , S ( y t ; ϕ ) ) q(\boldsymbol{z}_{1:T},\boldsymbol{y}_{1:T}) =
\prod_{t=1}^T
\mathcal{N}
\left(\boldsymbol{z}_t|
\boldsymbol{m}(\boldsymbol{y}_t;\boldsymbol{\phi}),
\boldsymbol{S}(\boldsymbol{y}_t;\boldsymbol{\phi})
\right) q ( z 1 : T , y 1 : T ) = t = 1 ∏ T N ( z t ∣ m ( y t ; ϕ ) , S ( y t ; ϕ ) ) This distribution captures the independent nature between the states, p ( z t , z 1 : t − 1 ) = p ( z t ) p(\boldsymbol{z}_t,\boldsymbol{z}_{1:t-1}) = p(\boldsymbol{z}_t) p ( z t , z 1 : t − 1 ) = p ( z t ) .
Markovian ¶ Another option is to do a linear transformation of the previous state and the current observation.
q ( z 0 : T , y 1 : T ) = N ( z 0 ∣ μ 0 , Σ ) ∏ t = 1 T N ( z t ∣ m ( y t , z t − 1 ; ϕ ) , S ( y t , z t − 1 ; ϕ ) ) q(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T}) =
\mathcal{N}(\boldsymbol{z}_0|\boldsymbol{\mu}_0,\boldsymbol{\Sigma})
\prod_{t=1}^T
\mathcal{N}\left(\boldsymbol{z}_t|
\boldsymbol{m}(\boldsymbol{y}_t,\boldsymbol{z}_{t-1};\boldsymbol{\phi}),
\boldsymbol{S}(\boldsymbol{y}_t,\boldsymbol{z}_{t-1};\boldsymbol{\phi})
\right) q ( z 0 : T , y 1 : T ) = N ( z 0 ∣ μ 0 , Σ ) t = 1 ∏ T N ( z t ∣ m ( y t , z t − 1 ; ϕ ) , S ( y t , z t − 1 ; ϕ ) ) This distribution captures the Markovian nature between states, p ( z t , z 1 : t − 1 ) = p ( z t ∣ z t − 1 ) p(\boldsymbol{z}_t,\boldsymbol{z}_{1:t-1}) = p(\boldsymbol{z}_t|\boldsymbol{z}_{t-1}) p ( z t , z 1 : t − 1 ) = p ( z t ∣ z t − 1 ) .
Autoregressive ¶ q ( z 0 : T , y 1 : T ) = N ( z 0 ∣ μ 0 , Σ ) ∏ t = 1 T N ( z t ∣ m ( y t , z 1 : t − 1 ; ϕ ) , S ( y t , z 1 : t − 1 ; ϕ ) ) q(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T}) =
\mathcal{N}(\boldsymbol{z}_0|\boldsymbol{\mu}_0,\boldsymbol{\Sigma})
\prod_{t=1}^T
\mathcal{N}\left(\boldsymbol{z}_t|
\boldsymbol{m}(\boldsymbol{y}_t,\boldsymbol{z}_{1:t-1};\boldsymbol{\phi}),
\boldsymbol{S}(\boldsymbol{y}_t,\boldsymbol{z}_{1:t-1};\boldsymbol{\phi})
\right) q ( z 0 : T , y 1 : T ) = N ( z 0 ∣ μ 0 , Σ ) t = 1 ∏ T N ( z t ∣ m ( y t , z 1 : t − 1 ; ϕ ) , S ( y t , z 1 : t − 1 ; ϕ ) ) This distribution captures the auto-regressive nature between the states, p ( z t , z 1 : t − 1 ) = p ( z t ∣ z 1 : t − 1 ) p(\boldsymbol{z}_t,\boldsymbol{z}_{1:t-1}) = p(\boldsymbol{z}_t|\boldsymbol{z}_{1:t-1}) p ( z t , z 1 : t − 1 ) = p ( z t ∣ z 1 : t − 1 ) .
Bi-Directional ¶ q ( z 0 : T , y 1 : T ) = N ( z 0 ∣ μ 0 , Σ ) ∏ t = 1 T N ( z t ∣ m ( y t , z 1 : T ; ϕ ) , S ( y t , z 1 : T ; ϕ ) ) q(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T}) =
\mathcal{N}(\boldsymbol{z}_0|\boldsymbol{\mu}_0,\boldsymbol{\Sigma})
\prod_{t=1}^T
\mathcal{N}\left(\boldsymbol{z}_t|
\boldsymbol{m}(\boldsymbol{y}_t,\boldsymbol{z}_{1:T};\boldsymbol{\phi}),
\boldsymbol{S}(\boldsymbol{y}_t,\boldsymbol{z}_{1:T};\boldsymbol{\phi})
\right) q ( z 0 : T , y 1 : T ) = N ( z 0 ∣ μ 0 , Σ ) t = 1 ∏ T N ( z t ∣ m ( y t , z 1 : T ; ϕ ) , S ( y t , z 1 : T ; ϕ ) ) This distribution captures the auto-regressive nature between the states, p ( z t , z 1 : T ) = p ( z t ∣ z 1 : T ) p(\boldsymbol{z}_t,\boldsymbol{z}_{1:T}) = p(\boldsymbol{z}_t|\boldsymbol{z}_{1:T}) p ( z t , z 1 : T ) = p ( z t ∣ z 1 : T ) .
Latent Encoders ¶ μ h t , Σ h t = T ( y 1 : T ; ϕ ) \boldsymbol{\mu_h}_t, \boldsymbol{\Sigma_h}_t = \boldsymbol{T}(\boldsymbol{y}_{1:T};\boldsymbol{\phi}) μ h t , Σ h t = T ( y 1 : T ; ϕ ) Now, we can redo each of the above methods using this encoder structure.
Conditionally Independent Observations ¶ Data Encoder : μ h t , Σ h t = T ( y 1 : T ; ϕ ) Variational : q ( z 1 : T , y 1 : T ) = ∏ t = 1 T N ( z t ∣ μ h t , Σ h t ) \begin{aligned}
\text{Data Encoder}: && &&
\boldsymbol{\mu_h}_t, \boldsymbol{\Sigma_h}_t &= \boldsymbol{T}(\boldsymbol{y}_{1:T};\boldsymbol{\phi})\\
\text{Variational}: && &&
q(\boldsymbol{z}_{1:T},\boldsymbol{y}_{1:T}) &=
\prod_{t=1}^T
\mathcal{N}
\left(\boldsymbol{z}_t|\boldsymbol{\mu_h}_t,\boldsymbol{\Sigma_h}_t\right)
\end{aligned} Data Encoder : Variational : μ h t , Σ h t q ( z 1 : T , y 1 : T ) = T ( y 1 : T ; ϕ ) = t = 1 ∏ T N ( z t ∣ μ h t , Σ h t ) This is referred to as the RNN Mean-Field encoder because the outputs are independent of the posterior.
Markovian ¶ Data Encoder : μ θ t , σ θ t = T ( y 1 : T ; ϕ ) Variational : q ( z 1 : T , y 1 : T ) = ∏ t = 1 T N ( z t ∣ m ( z t − 1 ; μ θ t ) , S ( z t − 1 ; σ θ t ) ) \begin{aligned}
\text{Data Encoder}: && &&
\boldsymbol{\mu_\theta}_t, \boldsymbol{\sigma_\theta}_t &= \boldsymbol{T}(\boldsymbol{y}_{1:T};\boldsymbol{\phi})\\
\text{Variational}: && &&
q(\boldsymbol{z}_{1:T},\boldsymbol{y}_{1:T}) &=
\prod_{t=1}^T
\mathcal{N}
\left(\boldsymbol{z}_t|
\boldsymbol{m}(\boldsymbol{z}_{t-1};\boldsymbol{\mu_\theta}_t),
\boldsymbol{S}(\boldsymbol{z}_{t-1};\boldsymbol{\sigma_\theta}_t)
\right)
\end{aligned} Data Encoder : Variational : μ θ t , σ θ t q ( z 1 : T , y 1 : T ) = T ( y 1 : T ; ϕ ) = t = 1 ∏ T N ( z t ∣ m ( z t − 1 ; μ θ t ) , S ( z t − 1 ; σ θ t ) ) This acts as a type of hyper-network whereby the weights of the variational distribution function are given by a another neural network, the RNN.
Autoregressive ¶ Data Encoder : μ θ t , σ θ t = T ( y 1 : T ; ϕ ) Variational : q ( z 1 : T , y 1 : T ) = ∏ t = 1 T N ( z t ∣ m ( z 1 : t − 1 ; μ θ t ) , S ( z 1 : t − 1 ; σ θ t ) ) \begin{aligned}
\text{Data Encoder}: && &&
\boldsymbol{\mu_\theta}_t, \boldsymbol{\sigma_\theta}_t &= \boldsymbol{T}(\boldsymbol{y}_{1:T};\boldsymbol{\phi})\\
\text{Variational}: && &&
q(\boldsymbol{z}_{1:T},\boldsymbol{y}_{1:T}) &=
\prod_{t=1}^T
\mathcal{N}
\left(\boldsymbol{z}_t|
\boldsymbol{m}(\boldsymbol{z}_{1:t-1};\boldsymbol{\mu_\theta}_t),
\boldsymbol{S}(\boldsymbol{z}_{1:t-1};\boldsymbol{\sigma_\theta}_t)
\right)
\end{aligned} Data Encoder : Variational : μ θ t , σ θ t q ( z 1 : T , y 1 : T ) = T ( y 1 : T ; ϕ ) = t = 1 ∏ T N ( z t ∣ m ( z 1 : t − 1 ; μ θ t ) , S ( z 1 : t − 1 ; σ θ t ) )