Skip to article frontmatterSkip to article content

State Space Models

CNRS
MEOM

Markov Models

In these notes, we walk through a model for modeling time-dependent data. By enforcing the Markov chain properties, we only have a variable at time, tt, depend on the variable at a previous time step, t1t-1. This results in very efficient directed graph which leads to inference of order O(T)\mathcal{O}(T).

The main source of inspiration for this is the lecture from the Probabilistic ML course from Tubingen Henning, 2020. Some of the details we taken from the probabilistic machine learning textbook from Kevin Murphy Murphy, 2012.


Motivation

Consider a large dimensional dataset, e.g. a data cube. This will be of size:

yRD\boldsymbol{y} \in \mathbf{R}^{D}

But let’s assume that it is a spatio-temporal dataset. Then we can decompose the dimension, DD into the following components.

Dimensions=[Time×Space×Variables]\text{Dimensions} = [ \text{Time} \times \text{Space} \times \text{Variables}]

So we can rewrite this with this decomposition

yRDt×Dy×DΩ\boldsymbol{y} \in \mathbf{R}^{D_t \times D_y \times D_\Omega}

This poses some immediate problems when we consider the full decomposition.

High Dimensionality. This is a very high dimensional dataset. For example, if we have a very long time series like 1,0001,000 time steps, then we will have a massive DD-dimensional vector for the input variable.

Time Dependencies. These time dependencies are very difficult to model. They are highly correlated, especially at very near, e.g. t1t-1, tt, and t1t-1.

Let’s say we are given a sequence of measurements, yn\boldsymbol{y}_n.

D={yn}n=1Nt,ynRDy\begin{aligned} \mathcal{D} &= \left\{ \boldsymbol{y}_n \right\}_{n=1}^{N_t}, && && \boldsymbol{y}_n\in\mathbb{R}^{D_y} \end{aligned}

How do we find an appropriate generative model for this sequence of measurements?


Schematic

This method seeks to decouple time by enforcing the Markov assumption.

markov_model

A graphical model for the dependencies between the variables x and z. Notice how z only depends on the previous time step.

This gives us the classical state-space formulation

Initial Distribution:z0p(z0;θ)Markovian Dynamics Model:ztp(ztzt1;θ)Measurement Model:ytp(ytzt;θ)\begin{aligned} \text{Initial Distribution}: && && \boldsymbol{z}_0 &\sim p(\boldsymbol{z}_0;\boldsymbol{\theta}) \\ \text{Markovian Dynamics Model}: && && \boldsymbol{z}_t &\sim p(\boldsymbol{z}_t|\boldsymbol{z}_{t-1};\boldsymbol{\theta}) \\ \text{Measurement Model}: && && \boldsymbol{y}_t &\sim p(\boldsymbol{y}_t|\boldsymbol{z}_{t};\boldsymbol{\theta}) \\ \end{aligned}

Here, we have an initial distribution as a prior over the first state, z0\boldsymbol{z}_0. We have a dynamics model as the prior over the forward state transitions. We have a measurement model as a generative model for the observations of the latent state.

The key is that by enforcing these Markovian assumptions, we have a directed graph structure that results in very efficient inference. This is all due to the Markov property due to the chain structure.


Markov Properties


Property of States

Given the immediate past, the present state is independent of the entire history before it.

Given zt1\boldsymbol{z}_{t-1}, zt\boldsymbol{z}_t is independent of any other of the previous states, e.g. zt2,zt3,\boldsymbol{z}_{t-2}, \boldsymbol{z}_{t-3}, \ldots.

p(ztz1:t1,y1:t1)=p(ztzt1)p(\boldsymbol{z}_t | \boldsymbol{z}_{1:t-1}, \boldsymbol{y}_{1:t-1}) = p(\boldsymbol{z}_t|\boldsymbol{z}_{t-1})

This is enforcing some kind of local memory within our system. So even if we have the full system of observed variables, y1:T\boldsymbol{y}_{1:T}, and the posterior states, z1:T\boldsymbol{z}_{1:T}, we still only have the dependencies on the previous time step:

p(zt1z1:T,y1:T)=p(zt1zt)p(\boldsymbol{z}_{t-1}|\boldsymbol{z}_{1:T}, \boldsymbol{y}_{1:T}) = p(\boldsymbol{z}_{t-1}|\boldsymbol{z}_t)

Bottom line - the past is independent of the future given the present.


Conditional Independence of Measurements

The current measurement is conditionally independent of the past measurements and states given the current state.

We assume that the measurement, yt\boldsymbol{y}_t, given the current state, zt\boldsymbol{z}_t, is conditionally independent of the measurements and its histories.

p(ytz1:t,y1:t1)=p(ytzt)p(\boldsymbol{y}_t|\boldsymbol{z}_{1:t}, \boldsymbol{y}_{1:t-1}) = p(\boldsymbol{y}_t|\boldsymbol{z}_t)

So as you can see, the measurement at time, tt, is only dependent on the state, z\boldsymbol{z}, at time tt state irregardless of how many other time steps have been observed.


Joint Distribution

Given the above properties, this represents how we decompose the time series. We use the properties mentioned above.

p(z0:T,y1:T;θ)=p(z0;θ)t=1Tp(ytzt;θ)p(ztzt1;θ)p(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T};\boldsymbol{\theta}) = p\left(\boldsymbol{z}_0;\boldsymbol{\theta} \right) \prod_{t=1}^T p\left(\boldsymbol{y}_t|\boldsymbol{z}_t;\boldsymbol{\theta}\right) p\left(\boldsymbol{z}_t|\boldsymbol{z}_{t-1};\boldsymbol{\theta}\right)

While this may not be immediately useful, it is useful for certain other quantities of interest.


Posterior

We are interested in finding the latent states, z0:T\boldsymbol{z}_{0:T}, given our observations, y1:T\boldsymbol{y}_{1:T}. However, due to the Markovian nature of the state space model, this process is a combination of
This is known as filtering.

p(zty1:t)=p(\boldsymbol{z}_t|\boldsymbol{y}_{1:t}) =


Quantities of Interest

Once we have the model structure, now we are interested in the specific quantities. All of them really boil down to quantities from inference.

TLDR

Posterior, p(zty1:t)p(\boldsymbol{z}_t|\boldsymbol{y}_{1:t}) - this is the probability of the state, zt\boldsymbol{z}_t given the current and previous measurements, y1:t\boldsymbol{y}_{1:t}.

Predict Step, p(zty1:t1)=p(ztzt1)p(z)p(\boldsymbol{z}_t|\boldsymbol{y}_{1:t-1})=\int p(\boldsymbol{z}_t|\boldsymbol{z}_{t-1})p(\boldsymbol{z}) - the current state, zt\boldsymbol{z}_t, given the past measurements, y1:t1\boldsymbol{y}_{1:t-1}.

Measurement Step, p(ztyt,y1:t1)p(ytzt)p(zty1:t1)p(\boldsymbol{z}_t|\boldsymbol{y}_t, \boldsymbol{y}_{1:t-1}) \propto p(\boldsymbol{y}_t|\boldsymbol{z}_t)p(\boldsymbol{z}_t|\boldsymbol{y}_{1:t-1}) - the current state, zt\boldsymbol{z}_t, given the present measurement yt\boldsymbol{y}_t and past measurements, y1:t1\boldsymbol{y}_{1:t-1}

Marginal Likelihood - p(y1:T)=tTp(yty1:t1)p(\boldsymbol{y}_{1:T}) = \sum_{t}^T p(\boldsymbol{y}_t|\boldsymbol{y}_{1:t-1}) - the likelihood of measurements, y1:T\boldsymbol{y}_{1:T}, given the state, z1:T\boldsymbol{z}_{1:T}.

Posterior Predictive: p(yty1:t1)=p(ytzt)p(ztzt1)dztp(\boldsymbol{y}_t|\boldsymbol{y}_{1:t-1}) = \int p(\boldsymbol{y}_t|\boldsymbol{z}_t)p(\boldsymbol{z}_t|\boldsymbol{z}_{t-1})d\boldsymbol{z}_t - The probability of the measurement, yt\boldsymbol{y}_t, given the previous measurements, y1:t1\boldsymbol{y}_{1:t-1}.

Sampling (Posterior): z1:Tp(zty1:T)\boldsymbol{z}_{1:T} \sim p(\boldsymbol{z}_t|\boldsymbol{y}_{1:T}) - Trajectories for states, z1:t\boldsymbol{z}_{1:t}, given the measurements, y1:T\boldsymbol{y}_{1:T}.

Sampling (Measurements): ytp(yty1:t1)\boldsymbol{y}_t \sim p(\boldsymbol{y}_t|\boldsymbol{y}_{1:t-1}) - Trajectories for observations, y1:T\boldsymbol{y}_{1:T}, given the state space model, z1:T\boldsymbol{z}_{1:T}.


Filtering/Posterior

The probability of the state, zt\boldsymbol{z}_t given the current and previous measurements, y1:t\boldsymbol{y}_{1:t}.

We are interested in computing the belief of our state, zt\boldsymbol{z}_t. This is given by

p(zty1:t)=1E(θ)p(ztyt)p(zty1:t)p(\boldsymbol{z}_t | \boldsymbol{y}_{1:t}) = \frac{1}{\boldsymbol{E}(\boldsymbol{\theta})} p(\boldsymbol{z}_t|\boldsymbol{y}_t)p(\boldsymbol{z}_t|\boldsymbol{y}_{1:t})

This equation is the posterior probability of zt\boldsymbol{z}_t given the present measurement, yt\boldsymbol{y}_t, and all of the past measurements, y1:t1\boldsymbol{y}_{1:t-1}. We can compute this using the Bayes method (eq bayes) in a sequential way.

This is given by the predict-update equations:

Prediction:p(zty1:t1;θ)=p(ztzt1;θ)p(zt1y1:t1;θ)dzt1Correction:p(zty1t;θ)=1E(θ)p(ytzt;θ)p(zty1:t1)\begin{aligned} \text{Prediction}: && && p(\boldsymbol{z}_t|\boldsymbol{y}_{1:t-1};\boldsymbol{\theta}) &= \int p(\boldsymbol{z}_t|\boldsymbol{z}_{t-1};\boldsymbol{\theta}) p(\boldsymbol{z}_{t-1}|\boldsymbol{y}_{1:t-1};\boldsymbol{\theta})d\boldsymbol{z}_{t-1} \\ \text{Correction}: && && p(\boldsymbol{z}_t|\boldsymbol{y}_{1-t};\boldsymbol{\theta}) &= \frac{1}{\boldsymbol{E}(\boldsymbol{\theta})} p(\boldsymbol{y}_t|\boldsymbol{z}_t;\boldsymbol{\theta}) p(\boldsymbol{z}_t|\boldsymbol{y}_{1:t-1}) \end{aligned}

It is a recursive relationship where the predict step depends on the correction step and vice versa.


Predict

The current state, zt\boldsymbol{z}_t, given the past measurements, y1:t1\boldsymbol{y}_{1:t-1}.

This quantity is given via the Chapman-Kolmogrov equation.

p(zty1:t1)=p(ztzt1)p(zt1y1:t1)dyt1p(\boldsymbol{z}_t|\boldsymbol{y}_{1:t-1}) = \int p(\boldsymbol{z}_t|\boldsymbol{z}_{t-1})p(\boldsymbol{z}_{t-1}|\boldsymbol{y}_{1:t-1})d\boldsymbol{y}_{t-1}

Term I: the transition distribution between time steps.

Term II: the posterior distribution of the state, zt1\boldsymbol{z}_{t-1}, given all of the observations, y1:t1\boldsymbol{y}_{1:t-1}.

Note: term III is the posterior distribution but at a previous time step.


Correction

The posterior distribution of state, zt\boldsymbol{z}_t, given the current and previous measurements, y1:t\boldsymbol{y}_{1:t}.

p(zty1:t)=p(ytzt)p(zty1:t1)p(yt)p(\boldsymbol{z}_t|\boldsymbol{y}_{1:t}) = \frac{p(\boldsymbol{y}_t|\boldsymbol{z}_t)p(\boldsymbol{z}_t|\boldsymbol{y}_{1:t-1})}{p(\boldsymbol{y}_t)}

Term I: The observation model for the current measurement, yt\boldsymbol{y}_t, given the current state, zt\boldsymbol{z}_t.

Term II: The posterior distribution of the current state, zt\boldsymbol{z}_t, given all of the previous measurements, y1:t1\boldsymbol{y}_{1:t-1}.

Term III: The marginal distribution for the current measurement, yt\boldsymbol{y}_t.


Filtering Algorithm

The full form for filtering equation is given by an iterative process between the predict step and the update step.

1. Predict the next hidden state

  • First you get the posterior of the previous state, zt1\boldsymbol{z}_{t-1}, given all of the observations, y1:t1\boldsymbol{y}_{1:t-1}.
  • Second, you get the posterior of the current state, zt\boldsymbol{z}_t, given all of the observations, p(y1:t1)p(\boldsymbol{y}_{1:t-1})
p(zt1y1:t1)p(zty1:t1)p(\boldsymbol{z}_{t-1}|\boldsymbol{y}_{1:t-1}) \rightarrow p(\boldsymbol{z}_t|\boldsymbol{y}_{1:t-1})

2. Predict the observation

  • First, you take the state, yt\boldsymbol{y}_t, given the previous measurements, yt\boldsymbol{y}_t.
  • Second you predict the current measurement, yt\boldsymbol{y}_t, given all previous measurements, y1:t1\boldsymbol{y}_{1:t-1}.
p(yty1:t1)p(yty1:t1)p(\boldsymbol{y}_t|\boldsymbol{y}_{1:t-1}) \rightarrow p(\boldsymbol{y}_t|\boldsymbol{y}_{1:t-1})

3. Update the hidden state given the observation

  • First, you take the new observation, yt\boldsymbol{y}_t
  • Then, you do an update step to get the current state, yt\boldsymbol{y}_t, given all previous measurements, y1:t\boldsymbol{y}_{1:t}.

Smoothing

We compute the state, zt\boldsymbol{z}_t, given all of the measurements, y1:T\boldsymbol{y}_{1:T} where 1<t<T1 < t < T.

p(zty1:T)p(\boldsymbol{z}_t|\boldsymbol{y}_{1:T})

We condition on the past and the future to significantly reduce the uncertainty.

This use case is very common when we want to understand and learn from data. In a practical sense, many reanalysis datasets take this into account.

p(zty1:T)=p(zty1:t)p(zt+1zt)p(zt+1y1:T)p(zt+1y1:t)dzt+1p(\boldsymbol{z}_t|\boldsymbol{y}_{1:T}) = p(\boldsymbol{z}_t|\boldsymbol{y}_{1:t}) \int p(\boldsymbol{z}_{t+1}|\boldsymbol{z}_t) \frac{p(\boldsymbol{z}_{t+1}|\boldsymbol{y}_{1:T})}{p(\boldsymbol{z}_{t+1}|\boldsymbol{y}_{1:t})}d\boldsymbol{z}_{t+1}

Term I: The current state, zt\boldsymbol{z}_t, given all of the past, current and future measurements, y1:T\boldsymbol{y}_{1:T} (smoothing step)

Term II: The current state, zt\boldsymbol{z}_t, given all of the present and previous measurements, y1:t\boldsymbol{y}_{1:t} (the predict step)

Term III: The “future” state, zt+1\boldsymbol{z}_{t+1}, given the previous state, zt\boldsymbol{z}_t (transition prob)

Term IV: The “future” state, zt+1\boldsymbol{z}_{t+1}, given all of the measurements, y1:T\boldsymbol{y}_{1:T}.

Term V: The “future” state, zt+1\boldsymbol{z}_{t+1}, given all of the current and past measurements, y1:T\boldsymbol{y}_{1:T}.


Predictions

We want to predict the future state, zT+τ\boldsymbol{z}_{T+\tau}, given the past measurements, y1:T\boldsymbol{y}_{1:T}.

p(zT+τy1:T)p(\boldsymbol{z}_{T+\tau}|\boldsymbol{y}_{1:T})

where τ>0\tau > 0. τ is the horizon of our forecasting, i.e. it is how far ahead of TT we are trying to predict. So we can expand this to write that we are interested in the future hidden states, zT+τ\boldsymbol{z}_{T+\tau}, given all of the past measurements, y1:T\boldsymbol{y}_{1:T}.

p(zT+τy1:T)=zT+τztp(zT+τzt)p(zty1:T)p(\boldsymbol{z}_{T+\tau}|\boldsymbol{y}_{1:T}) = \sum_{\boldsymbol{z}_{T+\tau}} \sum_{\boldsymbol{z}_t} p(\boldsymbol{z}_{T+\tau}|\boldsymbol{z}_t) p(\boldsymbol{z}_t|\boldsymbol{y}_{1:T})

We could also want to get predictions for what we observe

p(yT+τy1:t)=p(yT+τzT+τ)p(zT+τy1:T)p(\boldsymbol{y}_{T+\tau}|\boldsymbol{y}_{1:t}) = \sum p(\boldsymbol{y}_{T+\tau}|\boldsymbol{z}_{T+\tau})p(\boldsymbol{z}_{T+\tau}|\boldsymbol{y}_{1:T})

This is known as the posterior predictive density.

This is often the most common use case in applications, e.g. weather predictions and climate model projections. The nice thing is that we will have this as a by-product of our model.

Likelihood Estimation

E(θ)=p(yty1:t1;θ)=p(ytzt;θ)p(zty1:t1)dzt\boldsymbol{E}(\boldsymbol{\theta}) = p(\boldsymbol{y}_{t}|\boldsymbol{y}_{1:t-1};\boldsymbol{\theta}) = \int p(\boldsymbol{y}_t|\boldsymbol{z}_t;\boldsymbol{\theta}) p(\boldsymbol{z}_t|\boldsymbol{y}_{1:t-1})d\boldsymbol{z}_t

For learning, we need to calculate the most probable state-space that matches the given observations. This assumes that we have access to all of the measurements, y1:T\boldsymbol{y}_{1:T}.

LNLL=argmaxz1:Tp(z1:Ty1:T)\mathcal{L}_{NLL} = \operatorname*{argmax}_{\boldsymbol{z}_{1:T}} p(\boldsymbol{z}_{1:T}|\boldsymbol{y}_{1:T})

Note: This is a non-probabilistic approach to maximizing the likelihood. However, this could be very useful for some applications. Smoothing would be better but we still need to find the best parameters.

Posterior Samples

We are interested in generating possible states and state trajectories. In this case, we want the likelihood of a state trajectory, z1:T\boldsymbol{z}_{1:T}, given some measurements, y1:T\boldsymbol{y}_{1:T}. This is given by:

z1:Tp(z1:Ty1:T)\boldsymbol{z}_{1:T} \sim p(\boldsymbol{z}_{1:T}|\boldsymbol{y}_{1:T})

This is very informative because it can show us plausible interpretations of possible state spaces that could fit the measurements.

Marginal Likelihood

This is the probability of the evidence, i.e., the marginal probability of the measurements. This may be useful as an evaluation of the density of given measurements. We could write this as the joint probabilty

p(y1:T)=z1:Tp(z1:T,y1:T)p(\boldsymbol{y}_{1:T}) = \sum_{\boldsymbol{z}_{1:T}} p(\boldsymbol{z}_{1:T}, \boldsymbol{y}_{1:T})

We can decompose this using the conditional probability. This gives us

p(y1:T)=z1:Tp(y1:Tz1:T)p(z1:T)p(\boldsymbol{y}_{1:T}) = \sum_{\boldsymbol{z}_{1:T}} p(\boldsymbol{y}_{1:T}|\boldsymbol{z}_{1:T})p(\boldsymbol{z}_{1:T})

As shown by the above function, this is done by summing all of the hidden paths.

This can be useful if we want to use the learned model to classify sequences, perform clustering, or possibly anomaly detection.

Note: We can use the log version of this equation to deal with instabilities.

L=logp(y1:T)=z1:Tlogp(z1:T,y1:T)\mathcal{L} = \log p(\boldsymbol{y}_{1:T}) = \sum_{\boldsymbol{z}_{1:T}} \log p(\boldsymbol{z}_{1:T},\boldsymbol{y}_{1:T})

Complexity

This is the biggest reason why one would do a Markov assumptions aside from the simplicity. Let DD be the dimensionality of the state space, zz and TT be the number of time steps given by the measurements. We can give the computational complexity for each of the quantities listed above.

Filter-Predict

This is of order O(D2T)\mathcal{O}(D^2T)

If we assume sparsity in the methods, then we can reduce this to O(DT)\mathcal{O}(DT).

We can reduce the complexity even further by assuming some special matrices within the functions to give us a complexity of O(TDlogD)\mathcal{O}(T D \log D).

If we do a parallel computation, we can even have a really low computational complexity of O(DlogT)\mathcal{O}(D \log T).

Overall: the bottleneck of this method is not the computational speed, it’s the memory required to do all of the computations cheaply.


Viz

Bars
Regression

Cons

While we managed to reduce the dimensionality of our dataset, this might not be the optimal model to choose. We assume that zt\boldsymbol{z}_t only depends on the previous time step, ztt\boldsymbol{z}_{t-t}. But it could be the case that zt\boldsymbol{z}_t could depend on previous time steps, e.g. p(ztzt1,zt2,zt2,)p(\boldsymbol{z}_t | \boldsymbol{z}_{t-1}, \boldsymbol{z}_{t-2}, \boldsymbol{z}_{t-2}, \ldots). There is no reason to assume that

Multiscale Time Dependencies

One way to overcome this is to assume

Long Term

References
  1. Henning, P. (2020). Probabilistic ML - Lecture 12 - Gauss-Markov Models. https://youtu.be/FydcrDtZroU
  2. Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press. probml.ai