Context¶
Let’s say we are given a sequence of measurements, yn.
D={yn}n=1Nt We assume that there is some latent state, zt, which enables the sequential measurements to be conditionally independent.
Joint Distribution¶
This represents how we decompose the time series.
We use the properties mentioned above.
pθ(z0:T,y1:T)=pθ(z0)t=1∏Tpθ(yt∣zt)pθ(zt∣zt−1)
Posterior¶
We are interested in finding the latent states, z0:T, given our observations, y1:T.
However, due to the Markovian nature of the state space model, this process is a combination of
This is known as filtering.
pθ(zt∣y1:t)=Eθ1pθ(zt∣yt)pθ(zt∣y1:t) where the marginal likelihood, Eθ, is given by
Eθ=pθ(yt∣y1:t−1)=∫pθ(yt∣zt)pθ(zt∣y1:t−1)dzt This is typically given by the filtering algorithm which has a prediction and a correction step.
Prediction:Correction:pθ(zt∣y1:t−1)pθ(zt∣y1−t)=∫pθ(zt∣zt−1)p(zt−1∣y1:t−1)dzt−1=Eθ1pθ(yt∣zt)pθ(zt∣y1:t−1)
Variational Inference¶
We will start with the full posterior written like so
pθ(zt∣y1:t)=pθ(yt∣y1:t−1)pθ(z0:t,y1:t) but will rearrange this to have the marginal likelihood isolated
pθ(yt∣y1:t−1)=pθ(zt∣y1:t)pθ(z0:t,y1:t) Now, we will do the standard log transformation on both sides
logpθ(yt∣y1:t−1)=logpθ(zt∣y1:t)pθ(z0:t,y1:t) Then we will do the identity trick to push in our variational distribution, q(z1:t).
logpθ(yt∣y1:t−1)=Eqϕ(z0:t)[logpθ(zt∣y1:t)pθ(z0:t,y1:t)qϕ(z0:t)qϕ(z0:t)] Now, we can break apart the log terms
logpθ(yt∣y1:t−1)=Eqϕ(z0:t)[logqϕ(z0:t)pθ(z0:t,y1:t)+logpθ(zt∣y1:t)qϕ(z0:t)] and we can seperate the expectation terms as they are additive
logpθ(yt∣y1:t−1)=Eqϕ(z1:t)[logqϕ(z1:t)pθ(z0:t,y1:t)]+Eqϕ(z1:t)[logpθ(zt∣y1:t)qϕ(z1:t)] The 2nd term on the RHS is the KLD term which we can replace this with the more compact form.
logpθ(x)=Eqϕ(z0:t)[logqϕ(z0:t)pθ(z0:t,y1:t)]+DKL[qϕ(z0:t)∣∣pθ(zt∣y1:t)] The term on the right is now the variational gap term.
We know that this will always be greater than or equal to 0.
So we need to maximize the first term in order to minimize the second term, i.e., minimize the variational gap.
Thus, we can drop that term and put a lower bound on the likelihood.
We can decompose the joint distribution within the first term
LELBO:=Eqϕ(z0:t)[logqϕ(z0:t)pθ(z0:t,y1:t)]≤logpθ(zt∣y1:t) To clean this term up, first we will split the term using the log rules
LELBO:=Eqϕ(z0:T)[logpθ(z0:T,y1:T)−logqϕ(z0:T)] Now, we will decompose the joint distribution based on our priors.
LELBO:=Eqϕ(zt−1)[t=1∑Tlogpθ(yt∣zt)+t=1∑Tlogpθ(zt∣zt−1)−t=1∑Tlogqϕ(zt−1)] We can push the summations outside of the logs and expectations
LELBO:=t=1∑TEqϕ(zt−1)[logpθ(yt∣zt)+logpθ(zt∣zt−1)−logqϕ(zt−1)] Similar to our other derivations of the Variational distribution, we will also have 3 different terms depending upon how we break this apart.
Variational Free Energy (VFE)¶
There is one more main derivation that remains (that’s often seen in the literature). Looking at the equation (16) again we will isolate the likelihood and the prior under the variational expectation. This gives us:
LELBO=t=1∑TEqϕ(zt)[logpθ(yt∣zt)pθ(zt∣zt−1)]−t=1∑TEqϕ(zt−1)[logqϕ(zt−1)]. where:
- Eqϕ(zt)[logpθ(yt∣zt)pθ(zt∣zt−1)] - is the energy function
- Eqϕ(zt)[logqϕ(zt)] - is the entropy
Source: I see this approach a lot in the Gaussian process literature when they are deriving the Sparse Gaussian Process from Titsias.
Reconstruction Loss¶
This is the most common loss. Looking at equation (16) again, we group the prior probability and the variational distribution together, we get:
LELBO:=t=1∑TEqϕ(zt)[logpθ(yt∣zt)]+t=1∑TEqϕ(zt−1)[logqϕ(zt−1)pθ(zt∣zt−1)] This is the same KLD term as before but in the reverse order. So with a slight of hand in terms of the signs, we can rearrange the term to be
LELBO:=t=1∑TEqϕ(zt)[logpθ(yt∣zt)]−t=1∑TEqϕ(zt)[logpθ(zt∣zt−1)qϕ(zt)] So now, we have the exact same KLD term as before. So let’s use the simplified form.
LELBO=t=1∑TEqϕ(zt)[pθ(yt∣zt)]−t=1∑TDKL[qϕ(zt−1)∣∣pθ(zt∣zt−1)]. where:
- Eqϕ(zt)[pθ(yt∣zt)] - is the reconstruction loss.
- DKL[qϕ(zt−1)∣∣pθ(zt∣zt−1)] - is the complexity, i.e. the KL divergence (a distance metric) between the prior and the variational distribution.
This is easily the most common ELBO term especially with Variational AutoEncoders (VAEs). The first term is the expectation of the likelihood term wrt the variational distribution. The second term is the KLD between the prior and the variational distribution.
Volume Correction¶
Another approach is more along the lines of the transform distribution. Assume we have our original data domain X and we have some stochastic transformation, p(z|x), which transforms the data from our original domain to a transform domain, Z.
z∼p(z∣x) To acquire this from equation (16), we will isolate the prior and combine the likelihood and the variational distribution.
LELBO=t=1∑TEqϕ(zt−1)[logpθ(zt∣zt−1)]+t=1∑TEqϕ(zt)[logqϕ(zt)pθ(yt∣zt)]. where:
- Eqϕ(zt)[logpθ(zt∣zt−1)] - is the expectation of the transformed distribution, aka the reparameterized probability.
- Eqϕ(zt)[logqϕ(zt)pθ(yt∣zt)] - is the ratio between the inverse transform and the forward transform , i.e. Volume Correction Factor or likelihood contribution.
Source: I first saw this approach in the SurVAE Flows paper.
Loss Function¶
We have the generic ELBO loss function calculates a loss between the joint variational distribution and the joint prior distribution.
LELBO(θ,ϕ)=Eqϕ(z0:T)[logpθ(z0:T,y1:T)−logqϕ(z0:T)] where the prior parameters, θ, and variational parameters, ϕ.
So, we can calculate gradients
∇ϕ,θLELBO=∇ϕ,θEqϕ(z0:T)[logpθ(z0:T,y1:T)−logqϕ(z0:T)] The terms in this equation cannot be calculated in closed form.
So we must use some sort of Monte Carlo sampling routine
∇ϕ,θLELBO≈N1n=1∑N∇ϕ,θEqϕ(z0:T(n))[logpθ(z0:T(n),y1:T)−logqϕ(z0:T(n))] where z0:T(n) are samples from the latent state.
There are some difficulties regarding calculating gradients over expectations.
See the pyro-ppl guide for more information about this.
Variational Distributions¶
There are many ways one could
- Independent
- Markovian
- Autoregressive
- Bi-Directional
Independent¶
This first case is the simplest.
We assume that the state does not depend upon anything.
An example formulation can be given by:
q(z1:T,y1:T)=t=1∏TN(zt∣mϕ,Sϕ)
Conditional¶
This first case is the simplest.
We assume that the state only depends upon the observations, i.e., zt∼q(zt∣yt).
However, we allow for a non-linear relationship between the observations, yt, and the state, zt.
An example formulation can be given by:
q(z1:T,y1:T)=t=1∏TN(zt∣m(yt;ϕ),S(yt;ϕ)) This distribution captures the independent nature between the states, p(zt,z1:t−1)=p(zt).
Markovian¶
Another option is to do a linear transformation of the previous state and the current observation.
q(z0:T,y1:T)=N(z0∣μ0,Σ)t=1∏TN(zt∣m(yt,zt−1;ϕ),S(yt,zt−1;ϕ)) This distribution captures the Markovian nature between states, p(zt,z1:t−1)=p(zt∣zt−1).
Autoregressive¶
q(z0:T,y1:T)=N(z0∣μ0,Σ)t=1∏TN(zt∣m(yt,z1:t−1;ϕ),S(yt,z1:t−1;ϕ)) This distribution captures the auto-regressive nature between the states, p(zt,z1:t−1)=p(zt∣z1:t−1).
Bi-Directional¶
q(z0:T,y1:T)=N(z0∣μ0,Σ)t=1∏TN(zt∣m(yt,z1:T;ϕ),S(yt,z1:T;ϕ)) This distribution captures the auto-regressive nature between the states, p(zt,z1:T)=p(zt∣z1:T).
Latent Encoders¶
μht,Σht=T(y1:T;ϕ) Now, we can redo each of the above methods using this encoder structure.
Conditionally Independent Observations¶
Data Encoder:Variational:μht,Σhtq(z1:T,y1:T)=T(y1:T;ϕ)=t=1∏TN(zt∣μht,Σht) This is referred to as the RNN Mean-Field encoder because the outputs are independent of the posterior.
Markovian¶
Data Encoder:Variational:μθt,σθtq(z1:T,y1:T)=T(y1:T;ϕ)=t=1∏TN(zt∣m(zt−1;μθt),S(zt−1;σθt)) This acts as a type of hyper-network whereby the weights of the variational distribution function are given by a another neural network, the RNN.
Autoregressive¶
Data Encoder:Variational:μθt,σθtq(z1:T,y1:T)=T(y1:T;ϕ)=t=1∏TN(zt∣m(z1:t−1;μθt),S(z1:t−1;σθt))