Deep Markov Model - Research Notebook

Notes for the Deep Markov Model (DMM) algorithm.

\begin{aligned} z_0 &= \mathcal{N}(z_0|\mu_0, \Sigma_0)\\ p(z_t|z_{t-1}) &= \mathcal{N}(z_t|\boldsymbol{\mu}_{\text{trans}}(z_{t-1}), \boldsymbol{\sigma}^2_{\text{trans}}(z_{t_1})) \\ p(x_t|z_t) &= \mathcal{N}(x_t|\boldsymbol{\mu}_{\text{emiss}}(z_t), \boldsymbol{\sigma}^2_{\text{emiss}}(z_t)) \\ \end{aligned}

(1)

Notes on Deep Markov Models (DMM) (aka Deep Kalman Filters (DKF))

State Space Model¶

Figure 1:This showcases the interaction between the hidden state and the observations wrt time. The most important property is the Markovian property which dictates that the future state only depends upon the current state and no other previous states obtained before. Source: pyro Deep Markov Model Tutorial.

We are taking the same state-space model as before. However, this time, we do not restrict ourselves to linear functions. We allow for non-linear functions for the transition and emission functions.

We allow for a non-linear function, $\boldsymbol f$ , for the transition model between states.

\mathbf{z}_t = \boldsymbol{f}(\mathbf{z}_{t-1}; \boldsymbol{\theta}_t) + \boldsymbol{\delta}_t

(2)

We also put a non-linear function, $\boldsymbol h$ , for the emission model to describe the relationship between the state and the measurements.

\mathbf{x}_t = \boldsymbol{h}(\mathbf{z}_t; \boldsymbol{\theta}_e) + \boldsymbol{\epsilon}_t

(3)

(dmm_emission)

We are still going to assume that the output is Gaussian distributed (in the case of regression, otherwise this can be a Bernoulli distribution). We can write these distributions as follows:

\begin{aligned} p_{\theta_t}(\mathbf{z}_t|\mathbf{z}_{t-1}) &= \mathcal{N}(\mathbf{z}_t; \boldsymbol{f}(\mathbf{z}_{t-1}; \boldsymbol{\theta}_t),\mathbf{Q}_t) \\ p(\mathbf{x}_t|\mathbf{z}_t) &= \mathcal{N}(\mathbf{x}_t; \boldsymbol{h}(\mathbf{z}_t; \boldsymbol{\theta}_e), \mathbf{R}_t) \end{aligned}

(4)

where $\theta_t$ is the parameterization for the transition model and $\theta_e$ is the parameterization for the emission model. Notice how this assumes some non-linear transformation on the means of the Gaussian distributions however, we still want the output to be Gaussian.

If we are given all of the observations, $\mathbf{x}_{1:T}$ , we can write the joint distribution as:

p(\mathbf{x}_{1:T}, \mathbf{z}_{1:T}) = p(\mathbf{x}_{1:T}|\mathbf{z}_{1:T})p(\mathbf{z}_{1:T})

(5)

If we wish to find the best function parameters based on the data, we can still calculate the marginal likelihood by integrating out the state, $\mathbf{z}_{1:T}$ :

p_{\boldsymbol \theta}(\mathbf{x}_{1:T}) = \int p_{\theta_e}(\mathbf{x}_{1:T}|\mathbf{z}_{1:T})p_{\theta_e}(\mathbf{z}_{1:T})d\mathbf{z}_{1:T}

(6)

Inference¶

We can learn the parameters, $\boldsymbol \theta$ , of the prescribed model by minimizing the marginal log-likelihood. We can log transform the marginal likelihood function (eq (6)).

\log p_{\boldsymbol \theta}(\mathbf{x}) = \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol \theta}(\mathbf{z}|\mathbf{x})} \left[ \log p_{\theta_e}(\mathbf{x}|\mathbf{z}) + \log p_{\theta_e}(\mathbf{z})\right]

(7)

Loss Function¶

\mathcal{L}(\boldsymbol{\theta},\boldsymbol{\phi}; \mathbf{x}) = \sum_{t=1}^T \mathbb{E}_{q_{\boldsymbol \phi}}\left[ \log \underbrace{p_{\boldsymbol{\theta}_t}(\mathbf{z}_t|\mathbf{z}_{t-1})}_{\text{Transition}} + \log \underbrace{p_{\boldsymbol{\theta}_e}(\mathbf{x}_t|\mathbf{z}_{t})}_{\text{Emission}} - \log \underbrace{q_{\boldsymbol{\phi}}(\mathbf{z}_t|\mathbf{z}_{t-1}, \mathbf{x}_{1:T})}_{\text{Inference}} \right]

(8)

Training¶

We can estimate the gradients

Literature¶

2D Convolutional Neural Markov Models for Spatiotemporal Sequence Forecasting - Halim & Kawamoto, 2020
Physics-guided Deep Markov Models for Learning Nonlinear Dynamical Systems with Uncertainty - Liu et al., 2021
Kalman Variational AutoEncoder - Fraccaro et al., 2017 | Code
Normalizing Kalman Filter - Bézenac et al., 2020
Dynamical VAEs - Girin et al., 2021 | Code
Latent Linear Dynamics in Spatiotemporal Medical Data - Gunnarsson et al. (2021) arxiv

Model Components¶

Transition Function¶

p(z_t|z_{t-1}) = \mathcal{N}(z_t|\boldsymbol{\mu}(z_{t-1}), \boldsymbol{\sigma}^2(z_{t_1}))

(9)

where:

Functions¶

Gate

gate = Sequential(
    Linear(latent_dim, hidden_dim),
    ReLU(),
    Linear(hidden_dim, latent_dim),
    Sigmoid(),
)

Proposed Mean

proposed_mean = Sequential(
    Linear(latent_dim, hidden_dim),
    ReLU(),
    Linear(hidden_dim, latent_dim)
)

Mean

z_to_mu = Linear(latent_dim, latent_dim)

LogVar

z_to_logvar = Linear(latent_dim, latent_dim)

Initialization

Here, we want to ensure that the output starts out as the identity function. This helps training so that we don’t start out with completely nonsensical results which can lead to crazy gradients.

z_to_mu.weight = eye(latent_dim)
z_to_mu.bias = eye(latent_dim)

Function

z_gate = gate(z_t_1)
z_prop_mean = proposed_mean(z_t_1)
# mean prediction
z_mu = (1 - z_gate) * z_to_mu(z_t_1) + z_gate * z_prop_mean
# log var predictions
z_logvar = z_to_logvar(nonlin_fn(z_prop_mean))

Emission Function¶

p(x_t|z_t) = \mathcal{N}(x_t|\boldsymbol{\mu}_{\text{emiss}}(z_t), \boldsymbol{\sigma}^2_{\text{emiss}}(z_t))

(10)

where:

Tabular¶

Gate

class Emission:
    def __init__(self, latent_dim, hidden_dim, input_dim) -> None:
        super().__init__()
        self.input_dim = input_dim
        self.z_to_mu = Sequential(
            Linear(latent_dim, hidden_dim),
            Linear(hidden_dim, hidden_dim),
            Linear(hidden_dim, input_dim)
        )
        self.hidden_to_hidden = Linear(hidden_dim)

Proposed Mean

proposed_mean = Sequential(
    Linear(latent_dim, hidden_dim),
    ReLU(),
    Linear(hidden_dim, latent_dim)
)

Mean

z_to_mu = Linear(latent_dim, latent_dim)

LogVar

z_to_logvar = Linear(latent_dim, latent_dim)

Initialization

Here, we want to ensure that the output starts out as the identity function. This helps training so that we don’t start out with completely nonsensical results which can lead to crazy gradients.

z_to_mu.weight = eye(latent_dim)
z_to_mu.bias = eye(latent_dim)

Function

z_gate = gate(z_t_1)
z_prop_mean = proposed_mean(z_t_1)
# mean prediction
z_mu = (1 - z_gate) * z_to_mu(z_t_1) + z_gate * z_prop_mean
# log var predictions
z_logvar = z_to_logvar(nonlin_fn(z_prop_mean))

Resources¶

Code¶

References¶

Halim, C. J., & Kawamoto, K. (2020). 2D Convolutional Neural Markov Models for Spatiotemporal Sequence Forecasting. Sensors (Basel, Switzerland), 20.
Liu, W., Lai, Z., Bacsa, K., & Chatzi, E. (2021). Physics-guided Deep Markov Models for Learning Nonlinear Dynamical Systems with Uncertainty.
Fraccaro, M., Kamronn, S., Paquet, U., & Winther, O. (2017). A Disentangled Recognition and Nonlinear Dynamics Model for Unsupervised Learning.
de Bézenac, E., Rangapuram, S. S., Benidis, K., Bohlke-Schneider, M., Kurle, R., Stella, L., Hasson, H., Gallinari, P., & Januschowski, T. (2020). Normalizing Kalman Filters for Multivariate Time Series Analysis. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, & H. Lin (Eds.), Advances in Neural Information Processing Systems (Vol. 33, pp. 2995–3007). Curran Associates, Inc. https://proceedings.neurips.cc/paper/2020/file/1f47cef5e38c952f94c5d61726027439-Paper.pdf
Girin, L., Leglaive, S., Bie, X., Diard, J., Hueber, T., & Alameda-Pineda, X. (2021). Dynamical Variational Autoencoders: A Comprehensive Review. Foundations and Trends® in Machine Learning, 15(1–2), 1–175. 10.1561/2200000089

State Space Models

Ensemble Kalman Filter

Data Assimilation

Optimal Interpolation