Initial Condition : z 0 ∼ N ( z 0 ∣ μ 0 , Σ 0 ) Dynamical Model : z t ∼ N ( z t ∣ f ( z t − 1 , θ ) , Σ z ) Measurement Model : y t ∼ N ( y t ∣ h ( z t , θ ) , Σ y ) \begin{aligned}
\text{Initial Condition}: && &&
\boldsymbol{z}_0 &\sim
\mathcal{N}(\boldsymbol{z}_0|\boldsymbol{\mu}_0,\boldsymbol{\Sigma}_0) \\
\text{Dynamical Model}: && &&
\boldsymbol{z}_t &\sim
\mathcal{N}(\boldsymbol{z}_t|\boldsymbol{f}(\boldsymbol{z}_{t-1},\boldsymbol{\theta}),\boldsymbol{\Sigma_z}) \\
\text{Measurement Model}: && &&
\boldsymbol{y}_t &\sim
\mathcal{N}(\boldsymbol{y}_t|\boldsymbol{h}(\boldsymbol{z}_{t},\boldsymbol{\theta}),\boldsymbol{\Sigma_y}) \\
\end{aligned} Initial Condition : Dynamical Model : Measurement Model : z 0 z t y t ∼ N ( z 0 ∣ μ 0 , Σ 0 ) ∼ N ( z t ∣ f ( z t − 1 , θ ) , Σ z ) ∼ N ( y t ∣ h ( z t , θ ) , Σ y ) Assumptions :
Transition Function, f f f , and measurement function, h h h , are known. Gaussian system and measurement noise Gaussian distributions everywhere Core Operations ¶ Posterior Filtering - Prediction + Correction Marginal Likelihood Posterior Samples Bayesian ¶ Joint Distribution ¶ This represents how we decompose the time series.
We use the Markov property that states that every subsequent prediction at time step t + 1 t+1 t + 1 is independent of any previous time steps, t − τ t-\tau t − τ .
p ( z 0 : T , y 1 : T ) = N ( z 0 ∣ μ 0 , Σ 0 ) ∏ t = 1 T N ( y t ∣ h ( z t ; θ ) , Σ y ) N ( z t ∣ f ( z t − 1 ; θ ) , Σ z ) p(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T}) =
\mathcal{N}\left(\boldsymbol{z}_0|\boldsymbol{\mu}_0,\boldsymbol{\Sigma}_0\right)
\prod_{t=1}^T
\mathcal{N}\left(\boldsymbol{y}_t|\boldsymbol{h}(\boldsymbol{z}_t;\boldsymbol{\theta}),\boldsymbol{\Sigma_y}\right)
\mathcal{N}\left(\boldsymbol{z}_t|\boldsymbol{f}(\boldsymbol{z}_{t-1};\boldsymbol{\theta}),\boldsymbol{\Sigma_z}\right) p ( z 0 : T , y 1 : T ) = N ( z 0 ∣ μ 0 , Σ 0 ) t = 1 ∏ T N ( y t ∣ h ( z t ; θ ) , Σ y ) N ( z t ∣ f ( z t − 1 ; θ ) , Σ z ) We see that this factorizes.
Prediction Step ¶ p ( z t ∣ y 1 : t − 1 ) = ∫ N ( z t ∣ f ( z t − 1 ; θ ) , Σ z ) p ( z t − 1 ∣ y 1 : t − 1 ) d z t − 1 p(\boldsymbol{z}_t|\boldsymbol{y}_{1:t-1}) =
\int\mathcal{N}(\boldsymbol{z}_t|\boldsymbol{f}(\boldsymbol{z}_{t-1};\boldsymbol{\theta}),\boldsymbol{\Sigma_z})
p(\boldsymbol{z}_{t-1}|\boldsymbol{y}_{1:t-1})d\boldsymbol{z}_{t-1} p ( z t ∣ y 1 : t − 1 ) = ∫ N ( z t ∣ f ( z t − 1 ; θ ) , Σ z ) p ( z t − 1 ∣ y 1 : t − 1 ) d z t − 1 We can use a generic set of equations
Predict : p ( z t ) = N ( z t ∣ μ t ∣ t − 1 , Σ t ∣ t − 1 ) \begin{aligned}
\text{Predict}: && &&
p(\boldsymbol{z}_t) &=
\mathcal{N}(\boldsymbol{z}_t|\boldsymbol{\mu}_{t|t-1},\boldsymbol{\Sigma}_{t|t-1})
\end{aligned} Predict : p ( z t ) = N ( z t ∣ μ t ∣ t − 1 , Σ t ∣ t − 1 ) Correction Step ¶ p ( z t ∣ y 1 − t ; θ ) = 1 E ( θ ) N ( y t ∣ h ( z t ; θ ) , Σ y ) p ( z t ∣ y 1 : t − 1 ) p(\boldsymbol{z}_t|\boldsymbol{y}_{1-t};\boldsymbol{\theta}) =
\frac{1}{\boldsymbol{E}(\boldsymbol{\theta})}
\mathcal{N}(\boldsymbol{y}_t|\boldsymbol{h}(\boldsymbol{z}_t;\boldsymbol{\theta}),\boldsymbol{\Sigma_y})
p(\boldsymbol{z}_t|\boldsymbol{y}_{1:t-1}) p ( z t ∣ y 1 − t ; θ ) = E ( θ ) 1 N ( y t ∣ h ( z t ; θ ) , Σ y ) p ( z t ∣ y 1 : t − 1 ) For Gauss-Markov models, we can write a generic set of equations to calculate the analysis step.
Predictive Mean : μ t ∣ t z = μ t ∣ t − 1 z + Σ t ∣ t − 1 z y ( Σ t ∣ t − 1 y ) − 1 ( z t − μ t ∣ t − 1 y ) Predictive Covariance : Σ t ∣ t z = Σ t ∣ t − 1 z + Σ t ∣ t − 1 z y ( Σ t ∣ t − 1 y ) − 1 Σ t ∣ t − 1 y z \begin{aligned}
\text{Predictive Mean}: && &&
\boldsymbol{\mu}_{t|t}^{\boldsymbol{z}} &=
\boldsymbol{\mu}_{t|t-1}^{\boldsymbol{z}} +
\boldsymbol{\Sigma}_{t|t-1}^{\boldsymbol{zy}}
\left(\boldsymbol{\Sigma}_{t|t-1}^{\boldsymbol{y}} \right)^{-1}
(\boldsymbol{z}_t - \boldsymbol{\mu}_{t|t-1}^{\boldsymbol{y}}) \\
\text{Predictive Covariance}: && &&
\boldsymbol{\Sigma}_{t|t}^{\boldsymbol{z}} &=
\boldsymbol{\Sigma}_{t|t-1}^{\boldsymbol{z}} +
\boldsymbol{\Sigma}_{t|t-1}^{\boldsymbol{zy}}
\left(\boldsymbol{\Sigma}_{t|t-1}^{\boldsymbol{y}} \right)^{-1}
\boldsymbol{\Sigma}_{t|t-1}^{\boldsymbol{yz}} \\
\end{aligned} Predictive Mean : Predictive Covariance : μ t ∣ t z Σ t ∣ t z = μ t ∣ t − 1 z + Σ t ∣ t − 1 zy ( Σ t ∣ t − 1 y ) − 1 ( z t − μ t ∣ t − 1 y ) = Σ t ∣ t − 1 z + Σ t ∣ t − 1 zy ( Σ t ∣ t − 1 y ) − 1 Σ t ∣ t − 1 yz This is expressed in terms of means and (cross)-covariances.
This generic set of equations can be used to understand almost all filtering methods.
For example:
Linear + Conjugate -> Linear Kalman Filter Linearization via Taylor Approximations -> Extended Kalman Filter Sigma Points -> Unscented Kalman Filter Cubature Points -> Cubature Kalman Filter Moment Matching -> Assumed Density Filter Marginal Likelihood ¶ In the correction step, we see the normalization constant, E ( θ ) \boldsymbol{E}(\boldsymbol{\theta}) E ( θ ) .
E ( θ ) = p ( y t ∣ y 1 : t − 1 ; θ ) = ∫ N ( y t ∣ h ( z t ; θ ) , Σ y ) p ( z t ∣ y 1 : t − 1 ) d z t \boldsymbol{E}(\boldsymbol{\theta}) =
p(\boldsymbol{y}_{t}|\boldsymbol{y}_{1:t-1};\boldsymbol{\theta}) =
\int \mathcal{N}(\boldsymbol{y}_t|\boldsymbol{h}(\boldsymbol{z}_t;\boldsymbol{\theta}),\boldsymbol{\Sigma_y})
p(\boldsymbol{z}_t|\boldsymbol{y}_{1:t-1})d\boldsymbol{z}_t E ( θ ) = p ( y t ∣ y 1 : t − 1 ; θ ) = ∫ N ( y t ∣ h ( z t ; θ ) , Σ y ) p ( z t ∣ y 1 : t − 1 ) d z t Smoothing ¶ p ( z t ∣ y 1 : T ) = . . . p(\boldsymbol{z}_t|\boldsymbol{y}_{1:T}) = ... p ( z t ∣ y 1 : T ) = ... Exact Inference ¶ There are very few circumstances when we can get exact inference.
Observation Operator Encoder ¶ Forward Transform : z t = T ( y t ; θ ) Inverse Transform : y t = T − 1 ( z t ; θ ) \begin{aligned}
\text{Forward Transform}: && &&
\boldsymbol{z}_t &= \boldsymbol{T}(\boldsymbol{y}_t;\boldsymbol{\theta}) \\
\text{Inverse Transform}: && &&
\boldsymbol{y}_t &= \boldsymbol{T}^{-1}(\boldsymbol{z}_t;\boldsymbol{\theta}) \\
\end{aligned} Forward Transform : Inverse Transform : z t y t = T ( y t ; θ ) = T − 1 ( z t ; θ ) This can be seen from the [de Bézenac et al (2020) ]
The joint distribution will be:
p ( z 0 : T , y 1 : T ) = N ( z 0 ∣ μ 0 , Σ 0 ) ∏ t = 1 T N ( y t ∣ T − 1 ( z t ; θ ) , Σ y ) N ( z t ∣ f ( z t − 1 ; θ ) , Σ z ) p(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T}) =
\mathcal{N}\left(\boldsymbol{z}_0|\boldsymbol{\mu}_0,\boldsymbol{\Sigma}_0\right)
\prod_{t=1}^T
\mathcal{N}\left(\boldsymbol{y}_t|\boldsymbol{T}^{-1}(\boldsymbol{z}_t;\boldsymbol{\theta}),\boldsymbol{\Sigma_y}\right)
\mathcal{N}\left(\boldsymbol{z}_t|\boldsymbol{f}(\boldsymbol{z}_{t-1};\boldsymbol{\theta}),\boldsymbol{\Sigma_z}\right) p ( z 0 : T , y 1 : T ) = N ( z 0 ∣ μ 0 , Σ 0 ) t = 1 ∏ T N ( y t ∣ T − 1 ( z t ; θ ) , Σ y ) N ( z t ∣ f ( z t − 1 ; θ ) , Σ z ) # initial state prior
mu_0: Array["Dy"] = param("mu_0", init_value=...)
Sigma_0: Array["Dy Dy"] = param("Sigma_0", init_value=..., constrains=positive)
z0: Array["Dy"] = sample("z0", Normal(mu_0, Sigma_0))
# transition prior
F: Array["Dy Dy"] = param("F", init_value=...)
b: Array["Dy"] = param("b", init_value=...)
mu_z = F @ z0 + b
Sigma_z: Array["Dy Dy"] = param("Sigma_z", init_value=..., constrains=positive)
z: Array["Dy"] = sample("z", Normal(mu_z, Sigma_z))
# transition prior
flow = NSF(features=y.shape, *args, **kwargs)
obs: Array["Dy"] = sample("obs", flow(z), obs=y)
Parameter Estimation ¶ The assumptions can be any combination of the the following:
We have an unknown dynamics model, f f f , and and unknown measurement model, h h h . We have an unknown dynamics and measurement model parameters, θ \boldsymbol{\theta} θ . We have an unknown initial distribution, p ( z 0 ∣ θ ) p(\boldsymbol{z}_0|\boldsymbol{\theta}) p ( z 0 ∣ θ ) We have unknown Gaussian system and measurement noise, Σ z \boldsymbol{\Sigma_z} Σ z , Σ y \boldsymbol{\Sigma_y} Σ y In the first case of parameter estimation, we assume that we do not know
L ( θ ) : = log p ( y 1 : T ; θ ) = ∑ t = 1 T log p ( y t ∣ y 1 : t − 1 ; θ ) \boldsymbol{L}(\boldsymbol{\theta}) :=
\log p(\boldsymbol{y}_{1:T};\boldsymbol{\theta}) =
\sum_{t=1}^T\log p(\boldsymbol{y}_{t}|\boldsymbol{y}_{1:t-1};\boldsymbol{\theta}) L ( θ ) := log p ( y 1 : T ; θ ) = t = 1 ∑ T log p ( y t ∣ y 1 : t − 1 ; θ ) We can decompose this expression further
p ( y t ∣ y 1 : t − 1 ; θ ) = ∫ N ( y t ∣ h ( z t ) , Σ y ) p ( z t ∣ y 1 : t − 1 ) d z t p(\boldsymbol{y}_{t}|\boldsymbol{y}_{1:t-1};\boldsymbol{\theta}) =
\int \mathcal{N}(\boldsymbol{y}_t|\boldsymbol{h}(\boldsymbol{z}_t),\boldsymbol{\Sigma_y})
p(\boldsymbol{z}_t|\boldsymbol{y}_{1:t-1})d\boldsymbol{z}_t p ( y t ∣ y 1 : t − 1 ; θ ) = ∫ N ( y t ∣ h ( z t ) , Σ y ) p ( z t ∣ y 1 : t − 1 ) d z t Unfortunately, analytical expressions for the filtering distribution, p ( z t ∣ y 1 : t ) p(\boldsymbol{z}_t|\boldsymbol{y}_{1:t}) p ( z t ∣ y 1 : t ) and thus the data log-likelihood L ( θ ) \boldsymbol{L}(\boldsymbol{\theta}) L ( θ ) are only available for a small class of SSMs like the linear-Gaussian and discrete SSMs.
Thus we need to use approximate filtering distributions to estimate the log-likelihood.
State Estimation ¶ Assumptions :
Transition Function, f f f , and measurement function, h h h , are known. Gaussian system and measurement noise This is basically how we are able to fil
Deterministic Inference
Approximate Model (f f f ,h h h ) - Linearization, Sigma Points Approximate Integral Stochastic Inference :
Ensembles Monte Carlo / Particle Filter