This document goes over the Bayesian formulation in relation to some of the estimation and learning problems we have identified earlier.
State and Observations Parameters and Data State, Parameters & Data Temporal State and Observations Temporal State, Parameters & Data State & Observations ¶ In this case, we are interested in
Joint Distribution
p ( z , y ) = p ( y ∣ z ) p ( z ) = p ( z ∣ y ) p ( y ) p(\boldsymbol{z},\boldsymbol{y}) =
p(\boldsymbol{y}|\boldsymbol{z})p(\boldsymbol{z}) =
p(\boldsymbol{z}|\boldsymbol{y})p(\boldsymbol{y}) p ( z , y ) = p ( y ∣ z ) p ( z ) = p ( z ∣ y ) p ( y ) Posterior
p ( z ∣ y ) = p ( y ∣ z ) p ( z ) p ( y ) p(\boldsymbol{z}|\boldsymbol{y}) =
\frac{p(\boldsymbol{y}|\boldsymbol{z})p(\boldsymbol{z})}{p(\boldsymbol{y})} p ( z ∣ y ) = p ( y ) p ( y ∣ z ) p ( z ) Evidence
p ( y ) = ∫ p ( y ∣ z ) p ( z ) d z p(\boldsymbol{y}) =
\int p(\boldsymbol{y}|\boldsymbol{z})p(\boldsymbol{z}) d\boldsymbol{z} p ( y ) = ∫ p ( y ∣ z ) p ( z ) d z Posterior Predictive Distribution
p ( u ∣ z ) = ∫ p ( u ∣ z ) p ( z ∣ y ) d z p(\boldsymbol{u}|\boldsymbol{z}) =
\int p(\boldsymbol{u}|\boldsymbol{z})
p(\boldsymbol{z}|\boldsymbol{y})
d\boldsymbol{z} p ( u ∣ z ) = ∫ p ( u ∣ z ) p ( z ∣ y ) d z Parameters & Data ¶ Given some data D = { u n , z n } n = 1 N \mathcal{D} = \left\{ u_n, z_n\right\}_{n=1}^N D = { u n , z n } n = 1 N
Joint Distribution
p ( θ , D ) = p ( θ ∣ D ) p ( D ) = p ( D ∣ θ ) p ( θ ) p(\boldsymbol{\theta},\mathcal{D}) =
p(\boldsymbol{\theta}|\mathcal{D})
p(\mathcal{D}) =
p(\mathcal{D}|\boldsymbol{\theta})
p(\boldsymbol{\theta}) p ( θ , D ) = p ( θ ∣ D ) p ( D ) = p ( D ∣ θ ) p ( θ ) Posterior
p ( θ ∣ D ) = p ( D ∣ θ ) p ( D ) p(\boldsymbol{\theta}|\mathcal{D}) =
\frac{p(\mathcal{D}|\boldsymbol{\theta})}{p(\mathcal{D}) } p ( θ ∣ D ) = p ( D ) p ( D ∣ θ ) Evidence
p ( D ) = ∫ p ( D ∣ θ ) p ( θ ) d θ p(\mathcal{D}) =
\int p(\mathcal{D}|\boldsymbol{\theta})
p(\boldsymbol{\theta})
d\boldsymbol{\theta} p ( D ) = ∫ p ( D ∣ θ ) p ( θ ) d θ Posterior Predictive Distribution
p ( u ∣ z , θ ) = ∫ p ( u ∣ z , θ ) p ( θ ∣ D ) d θ p(\boldsymbol{u}|\boldsymbol{z},\boldsymbol{\theta}) =
\int p(\boldsymbol{u}|\boldsymbol{z},\boldsymbol{\theta})
p(\boldsymbol{\theta}|\mathcal{D})
d\boldsymbol{\theta} p ( u ∣ z , θ ) = ∫ p ( u ∣ z , θ ) p ( θ ∣ D ) d θ State, Parameters & Data ¶ Joint Distribution
p ( z , θ , y ) = p ( z ∣ y , θ ) p ( y ) = p ( y ∣ z , θ ) p ( z ∣ θ ) p ( θ ) p(\boldsymbol{z},\boldsymbol{\theta},\boldsymbol{y}) =
p(\boldsymbol{z}|\boldsymbol{y},\boldsymbol{\theta})
p(\boldsymbol{y}) =
p(\boldsymbol{y}|\boldsymbol{z},\boldsymbol{\theta})
p(\boldsymbol{z}|\boldsymbol{\theta})
p(\boldsymbol{\theta}) p ( z , θ , y ) = p ( z ∣ y , θ ) p ( y ) = p ( y ∣ z , θ ) p ( z ∣ θ ) p ( θ ) Posterior
p ( z ∣ θ ∣ y ) = 1 Z p ( z ∣ z , θ ) p ( z ∣ θ ) p ( θ ) p(\boldsymbol{z}|\boldsymbol{\theta}|\boldsymbol{y}) =
\frac{1}{Z}
p(\boldsymbol{z}|\boldsymbol{z},\boldsymbol{\theta})
p(\boldsymbol{z}|\boldsymbol{\theta})
p(\boldsymbol{\theta}) p ( z ∣ θ ∣ y ) = Z 1 p ( z ∣ z , θ ) p ( z ∣ θ ) p ( θ ) Evidence
Z = ∫ p ( y ∣ z , θ ) p ( z ∣ θ ) p ( θ ) d θ \mathbf{Z} =
\int p(\boldsymbol{y}|\boldsymbol{z},\boldsymbol{\theta})
p(\boldsymbol{z}|\boldsymbol{\theta})
p(\boldsymbol{\theta})
d\boldsymbol{\theta} Z = ∫ p ( y ∣ z , θ ) p ( z ∣ θ ) p ( θ ) d θ Posterior Predictive Distribution
p ( u ∣ z , θ ) = ∫ p ( u ∣ z , θ ) p ( z ∣ θ ∣ y ) d z p(\boldsymbol{u}|\boldsymbol{z},\boldsymbol{\theta}) =
\int p(\boldsymbol{u}|\boldsymbol{z},\boldsymbol{\theta})
p(\boldsymbol{z}|\boldsymbol{\theta}|\boldsymbol{y})
d\boldsymbol{z} p ( u ∣ z , θ ) = ∫ p ( u ∣ z , θ ) p ( z ∣ θ ∣ y ) d z State-Space Model and Observations ¶ In this example, we have a state space model and some observations.
So we break up the Joint Distribution into a temporal discretization.
p ( z , y ∣ θ ) = p ( z 0 : T , y 1 : T ∣ θ ) p(\boldsymbol{z},\boldsymbol{y}|\boldsymbol{\theta}) =
p(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T}|\boldsymbol{\theta}) p ( z , y ∣ θ ) = p ( z 0 : T , y 1 : T ∣ θ ) So our new Bayesian decomposition of the joint distribution is given by
p ( z 0 : T , y 1 : T ∣ θ ) = p ( y 1 : T ∣ z 0 : T , θ ) p ( z 0 : T ∣ θ ) = p ( z 0 : T ∣ y 1 : T , θ ) p ( y 1 : T ) \begin{aligned}
p(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T}|\boldsymbol{\theta}) &=
p(\boldsymbol{y}_{1:T}|\boldsymbol{z}_{0:T},\boldsymbol{\theta})
p(\boldsymbol{z}_{0:T}|\boldsymbol{\theta}) \\
&=
p(\boldsymbol{z}_{0:T}|\boldsymbol{y}_{1:T},\boldsymbol{\theta})
p(\boldsymbol{y}_{1:T})
\end{aligned} p ( z 0 : T , y 1 : T ∣ θ ) = p ( y 1 : T ∣ z 0 : T , θ ) p ( z 0 : T ∣ θ ) = p ( z 0 : T ∣ y 1 : T , θ ) p ( y 1 : T ) where we have a prior distribution defined as a dynamical model, a likelihood model for the measurements (and we will have a normalization constant).
Here, we have the following assumptions for the relationships
Prior : z 0 ∼ p ( z 0 ∣ θ ) Transition : z t ∼ p ( z t ∣ z t − 1 , θ ) Emission : y t ∼ p ( y t ∣ z t , θ ) \begin{aligned}
\text{Prior}: && \boldsymbol{z}_0
&\sim p(\boldsymbol{z}_0|\boldsymbol{\theta}) \\
\text{Transition}: && \boldsymbol{z}_{t}
&\sim p(\boldsymbol{z}_{t}|\boldsymbol{z}_{t-1},\boldsymbol{\theta}) \\
\text{Emission}: && \boldsymbol{y}_t
&\sim p(\boldsymbol{y}_t|\boldsymbol{z}_t,\boldsymbol{\theta})
\end{aligned} Prior : Transition : Emission : z 0 z t y t ∼ p ( z 0 ∣ θ ) ∼ p ( z t ∣ z t − 1 , θ ) ∼ p ( y t ∣ z t , θ ) So we can decompose the distributions listed above as
p ( z 0 : T , y 1 : T ∣ θ ) = p ( z 0 ) ∏ t = 1 T p ( z t ∣ z t − 1 ) ∏ t = 1 T p ( y t ∣ z t ) p(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T}|\boldsymbol{\theta})= p(\boldsymbol{z}_0)
\prod_{t=1}^Tp(\boldsymbol{z}_t|\boldsymbol{z}_{t-1})
\prod_{t=1}^Tp(\boldsymbol{y}_t|\boldsymbol{z}_t)\\ p ( z 0 : T , y 1 : T ∣ θ ) = p ( z 0 ) t = 1 ∏ T p ( z t ∣ z t − 1 ) t = 1 ∏ T p ( y t ∣ z t ) State Posterior
p ( z 0 : T ∣ y 1 : T , θ ) = p ( y 1 : T ∣ z 0 : T , θ ) p ( z 0 : T ∣ θ ) p ( y 1 : T ) \begin{aligned}
p(\boldsymbol{z}_{0:T}|\boldsymbol{y}_{1:T},\boldsymbol{\theta}) &=
\frac{p(\boldsymbol{y}_{1:T}|\boldsymbol{z}_{0:T},\boldsymbol{\theta})
p(\boldsymbol{z}_{0:T}|\boldsymbol{\theta})}{p(\boldsymbol{y}_{1:T})}
\end{aligned} p ( z 0 : T ∣ y 1 : T , θ ) = p ( y 1 : T ) p ( y 1 : T ∣ z 0 : T , θ ) p ( z 0 : T ∣ θ ) Evidence
p ( y 1 : T ) = ∫ p ( y 1 : T ∣ z 0 : T , θ ) p ( z 0 : T ∣ θ ) d z 0 : T = ∫ p ( z 0 ) ∏ t = 1 T p ( y t ∣ z t ) d z t \begin{aligned}
p(\boldsymbol{y}_{1:T}) &= \int
p(\boldsymbol{y}_{1:T}|\boldsymbol{z}_{0:T},\boldsymbol{\theta})
p(\boldsymbol{z}_{0:T}|\boldsymbol{\theta})
d\boldsymbol{z}_{0:T} \\
&= \int p(\boldsymbol{z}_0)
\prod_{t=1}^Tp(\boldsymbol{y}_t|\boldsymbol{z}_t)
d\boldsymbol{z}_t
\end{aligned} p ( y 1 : T ) = ∫ p ( y 1 : T ∣ z 0 : T , θ ) p ( z 0 : T ∣ θ ) d z 0 : T = ∫ p ( z 0 ) t = 1 ∏ T p ( y t ∣ z t ) d z t Predictive Posterior Distribution
p ( u 1 : T ∣ z 0 : T ) = ∫ p ( u 1 : T ∣ z 1 : T ) p ( z 1 : T ∣ y 1 : T , θ ) p ( z 0 ∣ θ ) d z 0 : T p(u_{1:T}|z_{0:T}) = \int
p(u_{1:T}|z_{1:T})p(z_{1:T}|y_{1:T},\theta)
p(z_0|\theta)dz_{0:T} p ( u 1 : T ∣ z 0 : T ) = ∫ p ( u 1 : T ∣ z 1 : T ) p ( z 1 : T ∣ y 1 : T , θ ) p ( z 0 ∣ θ ) d z 0 : T Other Quantities
Filtering Dist : p ( z t ∣ y 1 : t ) Predictive Dist : p ( z t + τ ∣ y 1 : t ) Smoothing Dist : p ( z t ∣ y 1 : T ) \begin{aligned}
\text{Filtering Dist}: && p(z_t|y_{1:t}) \\
\text{Predictive Dist}: && p(z_{t+\tau}|y_{1:t}) \\
\text{Smoothing Dist}: && p(z_t|y_{1:T})
\end{aligned} Filtering Dist : Predictive Dist : Smoothing Dist : p ( z t ∣ y 1 : t ) p ( z t + τ ∣ y 1 : t ) p ( z t ∣ y 1 : T ) Parameters, State-Space Model, & Observations ¶ In this example, we have a state space model and some observations.
So we break up the Joint Distribution into a temporal discretization.
p ( z , y , θ ) = p ( z 0 : T , y 1 : T , θ ) p(\boldsymbol{z},\boldsymbol{y},\boldsymbol{\theta}) =
p(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T},\boldsymbol{\theta}) p ( z , y , θ ) = p ( z 0 : T , y 1 : T , θ ) So our new Bayesian decomposition of the joint distribution is given by
p ( z 0 : T , y 1 : T ∣ θ ) = p ( y 1 : T ∣ z 0 : T , θ ) p ( z 0 : T ∣ θ ) p ( θ ) = p ( z 0 : T ∣ y 1 : T , θ ) p ( y 1 : T ) \begin{aligned}
p(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T}|\boldsymbol{\theta}) &=
p(\boldsymbol{y}_{1:T}|\boldsymbol{z}_{0:T},\boldsymbol{\theta})
p(\boldsymbol{z}_{0:T}|\boldsymbol{\theta})p(\theta) \\
&=
p(\boldsymbol{z}_{0:T}|\boldsymbol{y}_{1:T},\boldsymbol{\theta})
p(\boldsymbol{y}_{1:T})
\end{aligned} p ( z 0 : T , y 1 : T ∣ θ ) = p ( y 1 : T ∣ z 0 : T , θ ) p ( z 0 : T ∣ θ ) p ( θ ) = p ( z 0 : T ∣ y 1 : T , θ ) p ( y 1 : T ) where we have a prior distribution defined as a dynamical model, a likelihood model for the measurements (and we will have a normalization constant).
Here, we have the following assumptions for the relationships
Prior Parameters : θ ∼ p ( θ ) Prior State : z 0 ∼ p ( z 0 ∣ θ ) Transition : z t ∼ p ( z t ∣ z t − 1 , θ ) Emission : y t ∼ p ( y t ∣ z t , θ ) \begin{aligned}
\text{Prior Parameters}: && \boldsymbol{\theta}
&\sim p(\boldsymbol{\theta}) \\
\text{Prior State}: && \boldsymbol{z}_0
&\sim p(\boldsymbol{z}_0|\boldsymbol{\theta}) \\
\text{Transition}: && \boldsymbol{z}_{t}
&\sim p(\boldsymbol{z}_{t}|\boldsymbol{z}_{t-1},\boldsymbol{\theta}) \\
\text{Emission}: && \boldsymbol{y}_t
&\sim p(\boldsymbol{y}_t|\boldsymbol{z}_t,\boldsymbol{\theta})
\end{aligned} Prior Parameters : Prior State : Transition : Emission : θ z 0 z t y t ∼ p ( θ ) ∼ p ( z 0 ∣ θ ) ∼ p ( z t ∣ z t − 1 , θ ) ∼ p ( y t ∣ z t , θ ) So we can decompose the distributions listed above as
p ( z 0 : T , y 1 : T ∣ θ ) = p ( θ ) p ( z 0 ∣ θ ) ∏ t = 1 T p ( z t ∣ z t − 1 , θ ) ∏ t = 1 T p ( y t ∣ z t , θ ) p(\boldsymbol{z}_{0:T},\boldsymbol{y}_{1:T}|\boldsymbol{\theta})=
p(\theta)p(\boldsymbol{z}_0|\theta)
\prod_{t=1}^Tp(\boldsymbol{z}_t|\boldsymbol{z}_{t-1},\theta)
\prod_{t=1}^Tp(\boldsymbol{y}_t|\boldsymbol{z}_t,\theta)\\ p ( z 0 : T , y 1 : T ∣ θ ) = p ( θ ) p ( z 0 ∣ θ ) t = 1 ∏ T p ( z t ∣ z t − 1 , θ ) t = 1 ∏ T p ( y t ∣ z t , θ ) Joint Posterior
p ( z 0 : T , θ ∣ y 1 : T ) = p ( y 1 : T ∣ z 0 : T , θ ) p ( z 0 : T ∣ θ ) p ( θ ) p ( y 1 : T ) \begin{aligned}
p(\boldsymbol{z}_{0:T},\boldsymbol{\theta}|\boldsymbol{y}_{1:T}) &=
\frac{p(\boldsymbol{y}_{1:T}|\boldsymbol{z}_{0:T},\boldsymbol{\theta})
p(\boldsymbol{z}_{0:T}|\boldsymbol{\theta})p(\theta)}{p(\boldsymbol{y}_{1:T})}
\end{aligned} p ( z 0 : T , θ ∣ y 1 : T ) = p ( y 1 : T ) p ( y 1 : T ∣ z 0 : T , θ ) p ( z 0 : T ∣ θ ) p ( θ ) Evidence
p ( y 1 : T ) = ∫ ∫ p ( y 1 : T ∣ z 0 : T , θ ) p ( z 0 : T ∣ θ ) p ( θ ) d z 0 : T d θ = ∫ ∫ p ( θ ) p ( z 0 ) ∏ t = 1 T p ( y t ∣ z t ) d z t d θ \begin{aligned}
p(\boldsymbol{y}_{1:T}) &= \int\int
p(\boldsymbol{y}_{1:T}|\boldsymbol{z}_{0:T},\boldsymbol{\theta})
p(\boldsymbol{z}_{0:T}|\boldsymbol{\theta})p(\theta)
d\boldsymbol{z}_{0:T}d_\theta \\
&= \int\int p(\theta)p(\boldsymbol{z}_0)
\prod_{t=1}^Tp(\boldsymbol{y}_t|\boldsymbol{z}_t)
d\boldsymbol{z}_td\theta
\end{aligned} p ( y 1 : T ) = ∫∫ p ( y 1 : T ∣ z 0 : T , θ ) p ( z 0 : T ∣ θ ) p ( θ ) d z 0 : T d θ = ∫∫ p ( θ ) p ( z 0 ) t = 1 ∏ T p ( y t ∣ z t ) d z t d θ Predictive Posterior Distribution (State)
p ( θ ∣ y 1 : T ) = ∫ p ( z 0 : T , θ ∣ y 1 : T ) p ( θ ) d z 0 : T p(\theta|y_{1:T}) = \int p(z_{0:T},\theta|y_{1:T})p(\theta)dz_{0:T} p ( θ ∣ y 1 : T ) = ∫ p ( z 0 : T , θ ∣ y 1 : T ) p ( θ ) d z 0 : T Predictive Posterior Distribution (Parameters)
p ( z 0 : T ∣ y 1 : T ) = ∫ p ( z 0 : T , θ ∣ y 1 : T ) p ( θ ) d θ p(z_{0:T}|y_{1:T}) = \int p(z_{0:T},\theta|y_{1:T})p(\theta)d\theta p ( z 0 : T ∣ y 1 : T ) = ∫ p ( z 0 : T , θ ∣ y 1 : T ) p ( θ ) d θ Predictive Posterior Distribution (QoI)
State & Parameters : p ( u 1 : T ∣ z 0 : T , θ ) = ∫ p ( u 1 : T ∣ z 0 : T , θ ) p ( z 0 : T , θ ∣ y 1 : T ) p ( θ ) d z 0 : T d θ State : p ( u 1 : T ∣ z 0 : T ) = ∫ p ( u 1 : T ∣ z 0 : T ) p ( z 0 : T ∣ y 1 : T ) d z 0 : T Parameters : p ( u 1 : T ∣ θ ) = ∫ p ( u 1 : T ∣ θ ) p ( θ ∣ y 1 : T ) d θ \begin{aligned}
\text{State \& Parameters}: &&
p(u_{1:T}|z_{0:T},\theta) &= \int
p(u_{1:T}|z_{0:T},\theta)
p(\boldsymbol{z}_{0:T},\boldsymbol{\theta}|\boldsymbol{y}_{1:T})p(\theta)
dz_{0:T}d\theta\\
\text{State}: &&
p(u_{1:T}|z_{0:T}) &= \int
p(u_{1:T}|z_{0:T})
p(\boldsymbol{z}_{0:T}|\boldsymbol{y}_{1:T})
dz_{0:T}\\
\text{Parameters}: &&
p(u_{1:T}|\theta) &= \int
p(u_{1:T}|\theta)
p(\theta|\boldsymbol{y}_{1:T})
d\theta\\
\end{aligned} State & Parameters : State : Parameters : p ( u 1 : T ∣ z 0 : T , θ ) p ( u 1 : T ∣ z 0 : T ) p ( u 1 : T ∣ θ ) = ∫ p ( u 1 : T ∣ z 0 : T , θ ) p ( z 0 : T , θ ∣ y 1 : T ) p ( θ ) d z 0 : T d θ = ∫ p ( u 1 : T ∣ z 0 : T ) p ( z 0 : T ∣ y 1 : T ) d z 0 : T = ∫ p ( u 1 : T ∣ θ ) p ( θ ∣ y 1 : T ) d θ