Context ¶ Modeling ¶ Image, we have some data
D = { x n , y n } n = 1 N = { X , Y } \begin{aligned}
\mathcal{D} = \left\{ \boldsymbol{x}_n,\boldsymbol{y}_n\right\}_{n=1}^N =
\left\{ \boldsymbol{X},\boldsymbol{Y}\right\}
\end{aligned} D = { x n , y n } n = 1 N = { X , Y } We are interested in finding the joint distribution which maps the co-variates, x, to the observations, y.
We can decompose the joint distribution as follows
Joint Distribution : p ( x 1 : N , y 1 : N , θ ) = p ( θ ) ∏ n = 1 N p ( y n ∣ x n , θ ) \begin{aligned}
\text{Joint Distribution}: && &&
p(\boldsymbol{x}_{1:N},\boldsymbol{y}_{1:N},\boldsymbol{\theta}) &=
p(\boldsymbol{\theta})\prod_{n=1}^Np(\boldsymbol{y}_n|\boldsymbol{x}_n,\boldsymbol{\theta})
\end{aligned} Joint Distribution : p ( x 1 : N , y 1 : N , θ ) = p ( θ ) n = 1 ∏ N p ( y n ∣ x n , θ ) This can be done by finding the posterior parameters
Posterior : p ( θ ∣ D ) = 1 Z p ( Y ∣ X , θ ) p ( θ ) \begin{aligned}
\text{Posterior}: && &&
p(\boldsymbol{\theta}|\mathcal{D}) &= \frac{1}{Z}p(\boldsymbol{Y}|\boldsymbol{X},\boldsymbol{\theta})p(\boldsymbol{\theta})
\end{aligned} Posterior : p ( θ ∣ D ) = Z 1 p ( Y ∣ X , θ ) p ( θ ) There are a myriad of methods to find the parameters.
For example we could use some conjugate methods if the functions are linear or we can use approximate inference or sampling.
Irrespective of the method, we will have some set of parameters that we think are good for the model.
Post-Modeling Analysis ¶ However, assuming we have found the parameters, we are interested in performing sensitivity analysis which is whereby we take a distribution of interest of our input co-variates, x, and we perform some analysis on the generated outputs.
Co-Variates : x n ∗ ∼ p ( x ∗ ) Posterior Parameters : θ n ∼ p ( θ ∣ D ) Data Likelihood : y n ∼ p ( y n ∣ x n ∗ , θ n )
\begin{aligned}
\text{Co-Variates}: && &&
\boldsymbol{x}^*_n &\sim p(\boldsymbol{x}^*) \\
\text{Posterior Parameters}: && &&
\boldsymbol{\theta}_n &\sim p(\boldsymbol{\theta}|\mathcal{D}) \\
\text{Data Likelihood}: && &&
\boldsymbol{y}_n &\sim p(\boldsymbol{y}_n|\boldsymbol{x}^*_n,\boldsymbol{\theta}_n)
\end{aligned} Co-Variates : Posterior Parameters : Data Likelihood : x n ∗ θ n y n ∼ p ( x ∗ ) ∼ p ( θ ∣ D ) ∼ p ( y n ∣ x n ∗ , θ n ) So, our new problem is to find some sort of expectation over our data likelihood given our model parameters
Sensitivity Analysis : E x ∗ [ E θ [ p ( y ∣ x ∗ , θ ) ] ] = ∫ x ∗ ∫ θ p ( y ∣ x ∗ , θ ) p ( θ ∣ D ) p ( x ) d θ d x ∗ \begin{aligned}
\text{Sensitivity Analysis}: && &&
\mathbb{E}_{\boldsymbol{x}^*}
\left[\mathbb{E}_{\boldsymbol{\theta}}
\left[
p(\boldsymbol{y}|\boldsymbol{x}^*,\boldsymbol{\theta})
\right]
\right] &=
\int_{\boldsymbol{x}^*}\int_{\boldsymbol{\theta}}
p(\boldsymbol{y}|\boldsymbol{x}^*,\boldsymbol{\theta})
p(\boldsymbol{\theta}|\mathcal{D})p(\boldsymbol{x})
d\boldsymbol{\theta}d\boldsymbol{x}^*
\end{aligned} Sensitivity Analysis : E x ∗ [ E θ [ p ( y ∣ x ∗ , θ ) ] ] = ∫ x ∗ ∫ θ p ( y ∣ x ∗ , θ ) p ( θ ∣ D ) p ( x ) d θ d x ∗ In general, there are two ways we can tackle this: 1) deterministic inference and 2) stochastic inference.
Deterministic inference is well known within the community as approximations.
Stochastic inference is more known as MonteCarlo inference or some variant of.
Exact Inference ¶ This is the case when we have linear methods and simple, conjugate distributions.
If these two conditions are satisfied, then we can do a series of linear algebra tricks and identities to construct a closed-form solution to this integral problem.
Example ¶ In this example, we can predict the mean and the covariance under a linear Gaussian model.
Prior Distribution : x ∼ N ( x ∣ m , S ) Data Likelihood : y ∼ N ( y ∣ h ( x , θ ) , Σ y ) Linear Operator : h ( x , θ ) = W x + b \begin{aligned}
\text{Prior Distribution}: && &&
\boldsymbol{x} &\sim \mathcal{N}(\mathbf{x}\mid \mathbf{m},\mathbf{S}) \\
\text{Data Likelihood}: && &&
\boldsymbol{y} &\sim \mathcal{N}(\mathbf{y}\mid \boldsymbol{h}(\mathbf{x},\boldsymbol{\theta}), \boldsymbol{\Sigma}_\mathbf{y}) \\
\text{Linear Operator}: && &&
\boldsymbol{h}(\mathbf{x},\boldsymbol{\theta}) &= \mathbf{Wx} + \mathbf{b}
\end{aligned} Prior Distribution : Data Likelihood : Linear Operator : x y h ( x , θ ) ∼ N ( x ∣ m , S ) ∼ N ( y ∣ h ( x , θ ) , Σ y ) = Wx + b Parameters
# input covariates
m: Array["Dx"] = ... # prior mean covariate, x
S: Array["Dx Dx"] = ... # prior covariance covariate, x
# parameters
W: Array["Dy Dx"] = ... # weight matrix
b: Array["Dy"] = ... # bias vector
Q: Array["Dy Dy"] = ... # observation covariance matrix
Prediction Function
p ( y ) = ∫ N ( x ∣ μ x , Σ x ) N ( y ∣ W x + b , Σ y ) = N ( y ∣ W μ x , W μ x W T + Σ y )
\begin{aligned}
p(\mathbf{y}) &= \int \mathcal{N}(\mathbf{x}\mid \boldsymbol{\mu}_\mathbf{x},\boldsymbol{\Sigma}_\mathbf{x})
\mathcal{N}(\mathbf{y}\mid \mathbf{Wx} + \mathbf{b}, \boldsymbol{\Sigma}_\mathbf{y}) \\
&= \mathcal{N}(\mathbf{y}\mid \mathbf{W}\boldsymbol{\mu}_\mathbf{x}, \mathbf{W}\boldsymbol{\mu}_\mathbf{x}\mathbf{W}^T+\boldsymbol{\Sigma}_\mathbf{y})
\end{aligned} p ( y ) = ∫ N ( x ∣ μ x , Σ x ) N ( y ∣ Wx + b , Σ y ) = N ( y ∣ W μ x , W μ x W T + Σ y ) y_mu: Array["Dy"] = W @ mu_x + b
y_cov: Array["Dy Dy"] = W @ sigma_x @ W.T + Q
Sample Function
x ( n ) ∼ N ( x ∣ μ x , Σ x ) μ y ( n ) = h ( x ( n ) , θ ) y ( n ) ∼ MultivariateNormal ( μ y ( n ) , Σ y ) \begin{aligned}
\mathbf{x}^{(n)} &\sim \mathcal{N}(\mathbf{x}\mid \boldsymbol{\mu}_\mathbf{x},\boldsymbol{\Sigma}_\mathbf{x}) \\
\boldsymbol{\mu}_y^{(n)} &= \boldsymbol{h}(\mathbf{x}^{(n)},\boldsymbol{\theta}) \\
\boldsymbol{y}^{(n)}&\sim \text{MultivariateNormal}(\boldsymbol{\mu}_y^{(n)}, \boldsymbol{\Sigma}_y)
\end{aligned} x ( n ) μ y ( n ) y ( n ) ∼ N ( x ∣ μ x , Σ x ) = h ( x ( n ) , θ ) ∼ MultivariateNormal ( μ y ( n ) , Σ y ) # create covariate distribution
mvn_dist_x: Dist = MVM(μ_x, σ_x)
# sample covariates
n_samples: int = 100
seed: RNGKey = RNGKey(123)
x_samples: Array["Nx Dx"] = sample(dist=mvn_dist_x, seed=seed, shape=(n_samples,))
# calculate variables
μ_y: Array["Nx Dy"] = einsum("ND,D->ND", x, w) + b
σ_y: Array["Dy Dy"] = ...
# create observation distribution
mvn_dist_y: Dist = BatchedMVM(μ_y, σ_y)
# sample observations
n_samples: int = 100
seed: RNGKey = RNGKey(123)
y_samples: Array["Ny Nx Dy"] = sample(dist=mvn_dist_y, seed=seed, shape=(n_samples,))
Determinstic Inference ¶ In general, we can do deterministic inference using
Some example methods include:
Linearization, i.e., Taylor Expansion Unscented Transformation, i.e., Sigma Points Moment-Matching, i.e., GH-Quadrature, Bayesian Quadrature, etc. Example ¶ y n = h ( x n , θ ) + ε n , ε n ∼ N ( 0 , Q ) \begin{aligned}
\boldsymbol{y}_n &= \boldsymbol{h}(\boldsymbol{x}_n,\boldsymbol{\theta}) + \boldsymbol{\varepsilon}_n, && &&
\boldsymbol{\varepsilon}_n\sim\mathcal{N}(0,\mathbf{Q})
\end{aligned} y n = h ( x n , θ ) + ε n , ε n ∼ N ( 0 , Q ) Stochastic Inference ¶ Some example methods include:
Sequential Monte Carlo Ensemble Points We take a Gaussian potential on a new observation with an arbitrary likelihood given functions for conditional moments and make a Gaussian approximation.
The equation is given as so:
p ( z t ∣ y t , x t , y 1 : t − 1 , x 1 : t − 1 ) ∝ p ( z t ∣ y 1 : t − 1 , x 1 : t − 1 ) p ( y t ∣ z t , x t ) = N ( x t ∣ m , P ) E x t [ p ( y t ∣ z t , x t ) ]
\begin{aligned}
p(\boldsymbol{z}_t | \boldsymbol{y}_t, \boldsymbol{x}_t, \boldsymbol{y}_{1:t-1}, \boldsymbol{x}_{1:t-1})
&\propto
p(\boldsymbol{z}_t | \boldsymbol{y}_{1:t-1}, \boldsymbol{x}_{1:t-1})
p(\boldsymbol{y}_t | \boldsymbol{z}_t, \boldsymbol{x}_t) \\
&= N(\boldsymbol{x}_t | \mathbf{m}, \mathbf{P})
\hspace{2mm}\mathbb{E}_{\boldsymbol{x}_t}\left[p(\boldsymbol{y}_t |\boldsymbol{z}_t, \boldsymbol{x}_t)\right]
\end{aligned} p ( z t ∣ y t , x t , y 1 : t − 1 , x 1 : t − 1 ) ∝ p ( z t ∣ y 1 : t − 1 , x 1 : t − 1 ) p ( y t ∣ z t , x t ) = N ( x t ∣ m , P ) E x t [ p ( y t ∣ z t , x t ) ]