Skip to article frontmatterSkip to article content

Linear Gaussian Markov Models

CNRS
MEOM

Gauss-Markov Models

In the previous section, we looked at Markov models and showed how we can assume some conditional, local dependence on time. This simplified the graphical model for dependences which allowed us to get very fast inference. An additional simplification is to add the Gaussian assumption on the probability distributions and limiting the functions to linear functions with Gaussian additive noise. This results analytical forms for all of the quantities of interest for the inference, i.e. the Kalman filter and RTS smoothing algorithms. In this section, we will go over these assumptions and showcase the equations that result from this simplification.


TLDR

We can assume linear dynamics to describe the transition and emission models:

zt=Azt1+δxt=Hzt+ϵ\begin{aligned} \mathbf{z}_{t} &= \mathbf{A}\mathbf{z}_{t-1} + \boldsymbol{\delta} \\ \mathbf{x}_t &= \mathbf{Hz}_t + \boldsymbol{\epsilon} \end{aligned}

We can also assume that the prior, transition and emission distributions are all Gaussian distributed:

p(z0)=N(z0;μ0,Σ0)p(ztzt1)=N(zt;Azt1,Q)p(xtzt)=N(xt;Hzt,R)\begin{aligned} p(\mathbf{z}_0) &= \mathcal{N}(\mathbf{z}_0;\boldsymbol{\mu}_0, \boldsymbol{\Sigma}_0) \\ p(\mathbf{z}_{t}|\mathbf{z}_{t-1}) &= \mathcal{N}(\mathbf{z}_{t};\mathbf{Az}_{t-1}, \mathbf{Q}) \\ p(\mathbf{x}_t|\mathbf{z}_t) &= \mathcal{N}(\mathbf{x}_t;\mathbf{Hz}_t, \mathbf{R}) \end{aligned}

We are interested in the posterior p(ztxt)p(\mathbf{z}_t|\mathbf{x}_t). This can be calculated with the filtering, p(ztx1:t)p(\mathbf{z}_t|\mathbf{x}_{1:t}), and smoothing, p(ztx1:T)p(\mathbf{z}_t|\mathbf{x}_{1:T}), operations. Because of the nature of Gaussians, we have analytical forms for all of these equations. The filtering operation can be solved exactly using the Kalman Filter (KF) and the smoothing operation can be solved exactly using the Rauch-T-Striebel (RTS) smoother. Both operations can be computed in linear time, O(T)\mathcal{O}(T).

Kalman Filter

Recall, we need to do a filtering operation which consists of an alternating predict-update step through all of the time steps. Each of these operations has an analytical form which is known as the Kalman filter algorithm. The equations are outlined below.

Predict Step

First, we need the predict step.

p(xt)=N(xt;μtt1,Σtt1)p(\mathbf{x}_t) = \mathcal{N}(\mathbf{x}_t; \boldsymbol{\mu}_{t|t-1},\boldsymbol{\Sigma}_{t|t-1})

This equation is analytical but it is very involved. Below are each of the terms.

Predictive Mean

μtt1=Aμt1\boldsymbol{\mu}_{t|t-1} = \mathbf{A}\boldsymbol{\mu}_{t-1}

Predictive Covariance

Σtt1=AΣt1A+Q\boldsymbol{\Sigma}_{t|t-1} = \mathbf{A}\boldsymbol{\Sigma}_{t-1}\mathbf{A}^\top + \mathbf{Q}
Update Step
p(ztxt)=N(xt;μt,Σt)p(\mathbf{z}_t|\mathbf{x}_t) = \mathcal{N}(\mathbf{x}_t; \boldsymbol{\mu}_t, \boldsymbol{\Sigma}_t)

Innvoation Residual

rt=xtHμtt1\boldsymbol{r}_t = \mathbf{x}_t - \mathbf{H}\boldsymbol{\mu}_{t|t-1}

Innovation Covariance

St=HΣtt1H+R\mathbf{S}_t = \mathbf{H}\boldsymbol{\Sigma}_{t|t-1}\mathbf{H}^\top + \mathbf{R}

Kalman Gain

Kt=Σtt1HS1\mathbf{K}_t = \boldsymbol{\Sigma}_{t|t-1}\mathbf{H}^\top \mathbf{S}^{-1}

Estimation Mean

μt=μtt1+Ktxt\boldsymbol{\mu}_t = \boldsymbol{\mu}_{t|t-1} + \mathbf{K}_t\mathbf{x}_t

Estimation Covariance

Σt=(IKH)Σtt1\boldsymbol{\Sigma}_t = (\mathbf{I} - \mathbf{KH})\boldsymbol{\Sigma}_{t|t-1}

Kalman Smoothing

This is the smoothing operation which predicts the states at time tt given all of the measurements, 1:T1:T. This is given by:

p(ztx1:T)=N(zt;μ1:T,Σ1:T)p(\mathbf{z}_t|\mathbf{x}_{1:T}) = \mathcal{N}(\mathbf{z}_t; \boldsymbol{\mu}_{1:T}, \boldsymbol{\Sigma}_{1:T})

The terms within this equation are outlined below:

RTS Gain

Gt=ΣtA(Σt+1t)1\mathbf{G}_t = \boldsymbol{\Sigma}_t \mathbf{A}^\top(\boldsymbol{\Sigma}_{t+1|t})^{-1}

Smoothed Mean

μt:T=μt+Gt(μt+1Tμt+1t)\boldsymbol{\mu}_{t:T} = \boldsymbol{\mu}_t + \mathbf{G}_t(\boldsymbol{\mu}_{t+1|T} - \boldsymbol{\mu}_{t+1|t})

Smoothed Covariance

Σt:T=Σt+Gt(Σt+1TΣt+1t)G\boldsymbol{\Sigma}_{t:T} = \boldsymbol{\Sigma}_t + \mathbf{G}_t (\boldsymbol{\Sigma}_{t+1|T} - \boldsymbol{\Sigma}_{t+1|t})\mathbf{G}^\top

Note: These equations are very involved. In addition, these are the naive equations. There are many more matrix reformulations and manipulations that increase the stability or speedup. So it might be worth it to try and code it from scratch the first time, but it is worth using well-tested implementations.


Setting

Recall from the previous section that we were interested in Markov models due to their simplification of high-dimensional, high-correlated time series data. We envoke a few Markov properties like local memory and conditional independence of measures to get a very simplified graphical model. We also showcased all of the resulting functions that can be computed for the quantities of interest such as filtering, smoothing and posterior predictions. However, we did not mention anything about the functional form of these distributions. In principal, they can be anything they want.

Recall the joint distribution of interest:

p(z1:T,x1:T)=p(z0)Priort=2Tp(ztzt1)Transitiont=1Tp(xtzt)Emissionp(z_{1:T}, x_{1:T}) = \underbrace{p(z_0)}_{\text{Prior}}\prod_{t=2}^T\underbrace{p(z_t|z_{t-1})}_{\text{Transition}}\prod_{t=1}^T \underbrace{p(x_t|z_t)}_{\text{Emission}}

We see with have 3 terms that we need to define:

  • the prior specifies the initial condition of the latent variable, z\mathbf{z}
  • the transition model specifies the distribution of the latent variable between time steps, p(ztzt1)p(\mathbf{z}_t|\mathbf{z}_{t-1})
  • the emission model specifies the likelihood function of the measurement, x\mathbf{x} given the state vector, z\mathbf{z}.

We are going to put Gaussian assumptions on all of our distributions (i.e. prior, transition and emission)


Prior

We assume a Gaussian distribution for the prior of our state.

p(z0)=N(z0μ0,Σ0)p(\mathbf{z}_0) = \mathcal{N}(\mathbf{z}_0|\boldsymbol{\mu}_0, \boldsymbol{\Sigma}_0)

where μ0RDz\boldsymbol{\mu}_0 \in \mathbb{R}^{D_z} is the mean and Σ0RDz×Dz\boldsymbol{\Sigma}_0\in \mathbb{R}^{D_z \times D_z} is the covariance parameterizes the prior Gaussian distribution. This is a rather uninformative prior, but this ends up not mattering too much as the filter and smoothing solution tends to dominate after just a few time steps.

Transition Distribution

We also assume the transition function is a linear function with some additive Gaussian noise on the state.

zt=Azt1+δ\mathbf{z}_{t} = \mathbf{A}\mathbf{z}_{t-1} + \boldsymbol{\delta}

where ARDz×Dz\mathbf{A} \in \mathbb{R}^{D_z \times D_z} and δN(0,Q)\boldsymbol{\delta} \sim \mathcal{N}(\mathbf{0}, \mathbf{Q}). We can explicitly write the distribution for the transition model like so:

p(ztzt1)=N(ztAzt1,Q)p(\mathbf{z}_{t}|\mathbf{z}_{t-1}) = \mathcal{N}(\mathbf{z}_{t}|\mathbf{Az}_{t-1}, \mathbf{Q})

Note: the linear transformation gives us extra flexibility with sacrificing the easiness of manipulating Gaussian distributions. Any other non-linear function without any restrictions would give us problems later during the inference steps.

Emission Distribution

We also assume the emission function is a linear function with some additive Gaussian noise on the measurements.

xt=Hzt+ϵ\mathbf{x}_t = \mathbf{Hz}_t + \boldsymbol{\epsilon}

where HRDx×Dz\mathbf{H} \in \mathbb{R}^{D_x \times D_z} is a matrix and δN(0,R)\boldsymbol{\delta} \sim \mathcal{N}(\mathbf{0}, \mathbf{R}) is the additive Gaussian noise. Again, like the transition function, we can write the transition distribution as a Gaussian distribution like so:

p(xtzt)=N(xtHzt,R)p(\mathbf{x}_t|\mathbf{z}_t) = \mathcal{N}(\mathbf{x}_t|\mathbf{Hz}_t, \mathbf{R})

Proofs

Important Equations

Assumption: if we assume that the stochastic process involves a Markov Chain (i.e. a latent state) that evolves in a manner st

Markov Chains formalize the notion of a stochastic process with a local finite memory.

Inference over Markov Chains separates into three operations, that can be performed in linear time, i.e.O(T)\mathcal{O}(T).

Filtering
Predict

Filtering - Predict

This is the Chapman-Kolmogorov equation!

p(xty0:t1)=p(xtxt1)  p(xt1y0:t1)  dxt1p(\mathbf{x}_t|\mathbf{y}_{0:t-1}) = \int p(\mathbf{x}_t | \mathbf{x}_{t-1}) \; p(\mathbf{x}_{t-1}|\mathbf{y}_{0:t-1})\; d\mathbf{x}_{t-1}

We predict using all of the past data points...


Let’s take the standard integral equation:

p(ztx1:t+1)=p(ztzt1)p(zt1x1:t1)dzt1p(\mathbf{z}_t|\mathbf{x}_{1:t+1}) = \int p(\mathbf{z}_t|\mathbf{z}_{t-1})p(\mathbf{z}_{t-1}|\mathbf{x}_{1:t-1})d\mathbf{z}_{t-1}

We can substitute our Gaussian assumptions for each of the above equations.

Proof (Term I)

We can directly take this term from our assumption for equation 2.

p(zt+1zt)=N(zt+1Azt,Q)p(\mathbf{z}_{t+1}|\mathbf{z}_{t}) = \mathcal{N}(\mathbf{z}_{t+1}|\mathbf{Az}_t, \mathbf{Q})

The only difference is the change of the index, tt. We want p(ztzt1)p(\mathbf{z}_{t}|\mathbf{z}_{t-1}) instead of p(zt+1zt)p(\mathbf{z}_{t+1}|\mathbf{z}_t).

p(ztzt1)=N(ztAzt1,Q)p(\mathbf{z}_{t}|\mathbf{z}_{t-1})= \mathcal{N}(\mathbf{z}_{t}|\mathbf{Az}_{t-1}, \mathbf{Q})

QED.

Proof (Term II)

This term is actually much simpler than it seems. It comes from the assumption that our

p(ztx1:t+1)=N(ztAzt1,Q)  p(zt1x1:t1)  dzt1p(\mathbf{z}_t|\mathbf{x}_{1:t+1}) = \int \mathcal{N}(\mathbf{z}_{t}|\mathbf{Az}_{t-1}, \mathbf{Q})\;p(\mathbf{z}_{t-1}|\mathbf{x}_{1:t-1}) \; d\mathbf{z}_{t-1}
Update

Filtering - Update

p(xty0:t)=p(ytxt)p(xty0:t1)p(yt)p(\mathbf{x}_t|\mathbf{y}_{0:t}) = \frac{p(\mathbf{y}_t|\mathbf{x}_t)p(\mathbf{x}_t|\mathbf{y}_{0:t-1})}{p(\mathbf{y}_t)}

Smoothing

p(xty)=p(xty0:t)p(xt+1xt)  p(xt+1y)p(xt+1y1:t)  dxt+1p(\mathbf{x}_t|\mathbf{y}) = p(\mathbf{x}_t|\mathbf{y}_{0:t}) \int p(\mathbf{x}_{t+1}|\mathbf{x}_t)\; \frac{p(\mathbf{x}_{t+1}|\mathbf{y})}{p(\mathbf{x}_{t+1}|\mathbf{y}_{1:t})} \; d\mathbf{x}_{t+1}

Take-Home: we can use any structure we want for the probability density functions are, e.g. p(xy)p(\mathbf{x|y}), the inference in this model will be linear cost (instead of cubic cost like for Gaussian processes).

Ojo: The integrals are the tricky part...

Likelihood
pp

Gauss-Markov Models

Assumptions:

Predictive Distribution: is a Gaussian distribution

p(xti+1x1:i)=N(xi+1;Axi,Q)p(\mathbf{x}_{t_{i+1}}|\mathbf{x}_{1:i}) = \mathcal{N}(\mathbf{x}_{i+1}; \mathbf{Ax}_i, \mathbf{Q})

This will provide a linear relationship between the previous state, xt\mathbf{x}_t, and a subsequent state, xt+τ\mathbf{x}_{t+\tau}, with some Gaussian additive noise ϵx\boldsymbol{\epsilon}_{\mathbf{x}}.

We have an initial believe an initial state.

p(x0)=N(x0;m0,P0)p(\mathbf{x}_0)= \mathcal{N}(\mathbf{x}_0; \mathbf{m}_0, \mathbf{P}_0)

Observation Likelihood:

p(yix)=N(yi;Hx,R)p(\mathbf{y}_i|\mathbf{x})=\mathcal{N}(\mathbf{y}_i; \mathbf{Hx},\mathbf{R})

Easy Solution: Assume Gaussianity


One easy assumption is that we can assume linear dynamics

By assuming linear transition and emission dynamics, we can put Gaussian likelihoods. This results in filtering and smoothing posteriors which are exact and easy to calculate via the Kalman filter and smoothing algorithms.


Transition Dynamics

We can assume that the transition dynamics are linear transformation with an additive noise term.

fθ(zt) ⁣:=Ftzt1+ηt\boldsymbol{f}_{\boldsymbol \theta} (\mathbf{z}_{t}) \colon= \mathbf{F}_{t} \mathbf{z}_{t-1} + \boldsymbol{\eta}_t
  • FtRNz×Nz\mathbf{F}_{t} \in \mathbb{R}^{N_z \times N_z} is the transition matrix. This measures the physics of the process
  • ηtN(0,Σt)\boldsymbol{\eta}_t \sim \mathcal{N}(\mathbf{0},\boldsymbol{\Sigma}_t) is the additive noise model.

Because we’ve assumed linear dynamics, we can easily incorporate Gaussian assumptions on the mean and noise distribution, i.e.

p(ztzt1)N(ztFtzt1,Σt)p(\mathbf{z}_{t}|\mathbf{z}_{t-1}) \sim \mathcal{N}(\mathbf{z}_t| \mathbf{F}_{t} \mathbf{z}_{t-1}, \boldsymbol{\Sigma}_t)

We can of course solve for these terms exactly using a defined set of equations because we can propagate linear transformations through Gaussian distributions in closed form.

Note: we can assume the initial state is Gaussian distributed as well with or without transition dynamics, z0N(μ0,Σ0)\mathbf{z}_0 \sim \mathcal{N}(\boldsymbol{\mu}_0, \boldsymbol{\Sigma}_0).


Observation Model

Again, we assume the observation model can be defined by a linear operator,

xt=Atzt+ϵt\mathbf{x}_t = \mathbf{A}_t^\top \mathbf{z}_t + \boldsymbol{\epsilon}_t
  • AtRNz×Nx\mathbf{A}_{t} \in \mathbb{R}^{N_z \times N_x} is the emission matrix
  • ϵtN(0,Γt)\boldsymbol{\epsilon}_t \sim \mathcal{N}(\mathbf{0},\boldsymbol{\Gamma}_t) is the additive noise.

So, again, we can assume a Gaussian likelihood so this distribution is straightforward:

p(xtzt)N(xtAtzt,ϵt)p(\mathbf{x}_t|\mathbf{z}_t) \sim \mathcal{N}(\mathbf{x}_t| \mathbf{A}_t^\top \mathbf{z}_t, \boldsymbol{\epsilon}_t)

because we can propagate linear transformations through Gaussian distributions in closed form.


Note about Dimensionality

The observations, xtRNx\mathbf{x}_t \in \mathbb{R}^{N_x}, will have less features than the latent space, zRNz\mathbf{z} \in \mathbb{R}^{N_z}.

Nx<<NyN_x << N_y

This is not always true but we can assume that the dynamics of the state space are a higher dimension than the observation space.


Learnable Parameters

So for this model, we have the following parameters, θ\boldsymbol{\theta} to find:

  • Initial State - {μ0,Σ0}\{ \boldsymbol{\mu}_0, \boldsymbol{\Sigma}_0 \}
  • Transition Matices + Noise - {Ft,Σt}t1\{ \mathbf{F}_t, \boldsymbol{\Sigma}_t \}_{t \geq 1}
  • Emission Matrices + Noise - {At,Γt}t1\{ \mathbf{A}_t, \boldsymbol{\Gamma}t \}{t \geq 1}

Summary Equations

p(z0)=N(z0μ0,P0)p(ztzt1)=N(ztFzt1,Q)p(xtzt)=N(xtHzt,R)\begin{aligned} p(\mathbf{z}_0) &= \mathcal{N}(\mathbf{z}_0 | \boldsymbol{\mu}_0, \mathbf{P}_0) \\ p(\mathbf{z}_{t}|\mathbf{z}_{t-1}) &= \mathcal{N}(\mathbf{z}_{t}|\mathbf{F}\mathbf{z}_{t-1}, \mathbf{Q}) \\ p(\mathbf{x}_t|\mathbf{z}_t) &= \mathcal{N}(\mathbf{x}_t|\mathbf{H}\mathbf{z}_t, \mathbf{R}) \end{aligned}

Untitled


Pros and Cons

It is important to reflect on the assumptions we made about this model. This will give us some intuition on what we can expect and if there are underlying problems we can diagnosis more easily.

Pros

Simple

We have assumed linear functions and Gaussian distributions. These are easily interpretable because they are models we can understand and characterize completely.

Efficient

As mentioned above, all of the quantities can be calculated in linear time. There are only matrix multiplications.

Closed-Form / Exact

Inference is straightforward for these models because the joint distribution of the observation and the latent variable can be factorized:

p(x1:Tz1:T)=p(x1:Tz1:T)p(z1:T)p(\mathbf{x}_{1:T}|\mathbf{z}_{1:T}) = p(\mathbf{x}_{1:T}|\mathbf{z}_{1:T})p(\mathbf{z}_{1:T})

Furthermore, with the Markovian principal of the dependency only on the previous state, we can factorize this even more.

p(x1:Tz1:T)p(z1:T)=p(z0)t=1Tp(xtzt)t=2Tp(ztzt1)p(\mathbf{x}_{1:T}|\mathbf{z}_{1:T})p(\mathbf{z}_{1:T}) = p(\mathbf{z}_0)\prod_{t=1}^T p(\mathbf{x}_{t}|\mathbf{z}_{t}) \prod_{t=2}^{T} p(\mathbf{z}_{t}|\mathbf{z}_{t-1})

All of the terms within this equation are known and have exact equations (the Kalman filter equations).

Cons

Linear Assumptions

We linear assumptions for both the transition model and the observation. This can fail spectacularly for non-linear cases.

High-Dimensionality

The input observation data, x\mathbf{x}, can be very high dimensional and has different characteristics. For example we can have the following data types:

  • Tabular Data: Features = 50
  • Time Series: Time x Features = 100 x 50 = 5_000
  • Images: Channels x Height x Width = 3 x 32 x 32 = 3_072
  • Video: Time x Channels x Height x Width = 100 x 3 x 32 x 32 = 30_720

Fortunately, given the Markovian assumption, we can factorize out the temporal dimensions. However, we’re still left with a lot of samples to calculate correlations.

  • Tabular Data: Features = 50
  • Images: Channels x Height x Width = 3 x 32 x 32 = 3_072

In addition, we have a large parameter space because we don’t utilize any architectures to capture any non-linear pairwise correlations between the data points in space (or time). For example, a convolutional operator would be able to capture local correlations wrt the channels in space.


Non-Linearity

As mentioned above, one downside is the assumption that the observation model can be captured in a linear projection of the state, zt\mathbf{z}_t. It is well-known that the dynamics are in fact non-linear. We augment the observation model with Normalizing Flows.

xt=Atzt+ϵtyt=gθ(xt)\begin{aligned} \mathbf{x}_t &= \mathbf{A}_t^\top \mathbf{z}_t + \boldsymbol{\epsilon}_t \\ \mathbf{y}_t &= \mathbf{g}_{\boldsymbol \theta}(\mathbf{x}_t) \end{aligned}
  • FtRNz×Nz\mathbf{F}_{t} \in \mathbb{R}^{N_z \times N_z} is the transition matrix
  • ηtN(0,Σt)\boldsymbol{\eta}_t \sim \mathcal{N}(\mathbf{0},\boldsymbol{\Sigma}_t) is the additive noise model.
  • gθ:YRNyXRNx\mathbf{g}_{\boldsymbol \theta}: \mathcal{Y}\in\mathbb{R}^{N_y} \rightarrow \mathcal{X}\in\mathbb{R}^{N_x} - is an invertible, diffeomorphic transformation

Assumption: by augmented the observations yRNy\mathbf{y}\in \mathbb{R}^{N_y} with an invertible transformation, gθ\boldsymbol{g}_{\boldsymbol \theta}, we can obtain a latent representation s.t. the observation model is a linear transformation.

The only term that it affects is the likelihood term, p(xtzt)p(\mathbf{x}_t|\mathbf{z}_t). By definition, it is an invertible transformation, so we can calculate the likelihood term exactly.

p(ytzt)=pX(xtzt)detytgθ(yt)p(\mathbf{y}_t|\mathbf{z}_t) = p_\mathcal{X}(\mathbf{x}_t|\mathbf{z}_t)\left| \det \boldsymbol{\nabla}_{\mathbf{y}_t} \boldsymbol{g}_{\boldsymbol \theta}(\mathbf{y}_t) \right|

So we can continue to take advantage of the closed-form solutions only with the edition of a non-linear transformation.

Note: The bottleneck of this term


Dimensionality Reduction

Another aspect mentioned above is the high-dimensional data, x\mathbf{x}. The input could be a very long feature-vector or perhaps an image. A linear transformation would have trouble capturing and generalizing over a complex input space.

We can use a Variational AutoEncoder (VAE) to embed the input yRNy\mathbf{y} \in \mathbb{R}^{N_y} to a lower dimensional representation, xRNx\mathbf{x} \in \mathbb{R}^{N_x}. This makes use of a encoder-decoder structure. The