Multivariate Gaussian ¶ N ( u ∣ μ , Σ ) = 1 ( 2 π ) D / 2 1 ∣ Σ ∣ 1 / 2 exp [ − 1 2 ( u − μ ) ⊤ Σ − 1 ( u − μ ) ] \mathcal{N}(\boldsymbol{u}|\boldsymbol{\mu},\boldsymbol{\Sigma}) =
\frac{1}{(2\pi)^{D/2}}
\frac{1}{|\boldsymbol{\Sigma}|^{1/2}}
\exp
\left[
-\frac{1}{2}
(\boldsymbol{u} - \boldsymbol{\mu})^\top
\boldsymbol{\Sigma}^{-1}
(\boldsymbol{u} - \boldsymbol{\mu})
\right] N ( u ∣ μ , Σ ) = ( 2 π ) D /2 1 ∣ Σ ∣ 1/2 1 exp [ − 2 1 ( u − μ ) ⊤ Σ − 1 ( u − μ ) ] where:
u ∈ R D \boldsymbol{u}\in\mathbb{R}^{D} u ∈ R D - D D D -dimensional vectorμ ∈ R D \boldsymbol{\mu}\in\mathbb{R}^{D} μ ∈ R D - D D D -dimensional mean vectorΣ ∈ R D × D \boldsymbol{\Sigma}\in\mathbb{R}^{D\times D} Σ ∈ R D × D - D × D D\times D D × D -dimensional covariance matrixMahalanobis Distance ¶ We often call this the quadratic term.
Mah Distance : Δ 2 = ( u − μ ) ⊤ Σ − 1 ( u − μ ) Δ = ( u − μ ) ⊤ Σ − 1 ( u − μ ) Δ = Σ − 1 ( u − μ ) ⊤ ( u − μ ) \begin{aligned}
\text{Mah Distance}: && &&
\boldsymbol{\Delta}^2 &=
(\boldsymbol{u} - \boldsymbol{\mu})^\top
\boldsymbol{\Sigma}^{-1}
(\boldsymbol{u} - \boldsymbol{\mu}) \\
&& &&
\boldsymbol{\Delta} &=
\sqrt{(\boldsymbol{u} - \boldsymbol{\mu})^\top
\boldsymbol{\Sigma}^{-1}
(\boldsymbol{u} - \boldsymbol{\mu})} \\
&& &&
\boldsymbol{\Delta} &=
\sqrt{\boldsymbol{\Sigma}^{-1}(\boldsymbol{u} - \boldsymbol{\mu})^\top
(\boldsymbol{u} - \boldsymbol{\mu})} \\
\end{aligned} Mah Distance : Δ 2 Δ Δ = ( u − μ ) ⊤ Σ − 1 ( u − μ ) = ( u − μ ) ⊤ Σ − 1 ( u − μ ) = Σ − 1 ( u − μ ) ⊤ ( u − μ ) Case I: Identity ¶ Euclidean Distance : ( u − μ ) ⊤ ( u − μ ) \text{Euclidean Distance}: \hspace{2mm}
(\boldsymbol{u} - \boldsymbol{\mu})^\top
(\boldsymbol{u} - \boldsymbol{\mu}) Euclidean Distance : ( u − μ ) ⊤ ( u − μ ) Case II: Scalar ¶ Euclidean Distance : σ ( u − μ ) ⊤ ( u − μ ) \text{Euclidean Distance}: \hspace{2mm}
\sigma
(\boldsymbol{u} - \boldsymbol{\mu})^\top
(\boldsymbol{u} - \boldsymbol{\mu}) Euclidean Distance : σ ( u − μ ) ⊤ ( u − μ ) Case III: Diagonal ¶ Euclidean Distance : σ − 1 ( u − μ ) ⊤ ( u − μ ) \text{Euclidean Distance}: \hspace{2mm}
\boldsymbol{\sigma}^{-1}
(\boldsymbol{u} - \boldsymbol{\mu})^\top
(\boldsymbol{u} - \boldsymbol{\mu}) Euclidean Distance : σ − 1 ( u − μ ) ⊤ ( u − μ ) Case IV: Decomposition ¶ Case V: Full Covariance ¶ Masked Likelihood ¶ Conditional Gaussian Distributions ¶ We have the joint distribution for the latent variables, z \boldsymbol{z} z , and a QoI, u \boldsymbol{u} u .
[ z u ] ∼ N ( [ z u ] ∣ [ z ˉ u ˉ ] , [ Σ z z Σ z u Σ u z Σ u u ] ) \begin{bmatrix}
\boldsymbol{z} \\
\boldsymbol{u}
\end{bmatrix}
\sim \mathcal{N}
\left(
\begin{bmatrix}
\boldsymbol{z} \\
\boldsymbol{u}
\end{bmatrix}
\mid
\begin{bmatrix}
\bar{\boldsymbol{z}} \\
\bar{\boldsymbol{u}}
\end{bmatrix},
\begin{bmatrix}
\boldsymbol{\Sigma_{zz}} & \boldsymbol{\Sigma_{zu}}\\
\boldsymbol{\Sigma_{uz}} & \boldsymbol{\Sigma_{uu}}
\end{bmatrix}
\right) [ z u ] ∼ N ( [ z u ] ∣ [ z ˉ u ˉ ] , [ Σ zz Σ uz Σ zu Σ uu ] ) Marginal Distributions ¶ We have the marginal distribution for the variable, z \boldsymbol{z} z
p ( z ) = N ( z ∣ z ˉ , Σ z z ) \begin{aligned}
p(\boldsymbol{z}) &=
\mathcal{N}
\left(
\boldsymbol{z} \mid
\boldsymbol{\bar{z}},
\boldsymbol{\Sigma_{zz}}
\right)
\end{aligned} p ( z ) = N ( z ∣ z ˉ , Σ zz ) We have the conditional likelihood for the variable, u \boldsymbol{u} u
p ( u ) = N ( u ∣ u ˉ , Σ u u ) \begin{aligned}
p(\boldsymbol{u}) &=
\mathcal{N}
\left(
\boldsymbol{u} \mid
\bar{\boldsymbol{u}},
\boldsymbol{\Sigma_{uu}}
\right)
\end{aligned} p ( u ) = N ( u ∣ u ˉ , Σ uu ) Conditional Distributions ¶ We have the conditional likelihood for the variable, z \boldsymbol{z} z
p ( z ∣ u ) = N ( z ∣ μ z ∣ u , Σ z ∣ u ) μ z ∣ u = z ˉ + Σ z u Σ u u − 1 ( u − u ˉ ) Σ z ∣ u = Σ z z − Σ z u Σ z z − 1 Σ u z \begin{aligned}
p(\boldsymbol{z}|\boldsymbol{u}) &=
\mathcal{N}
\left(
\boldsymbol{z} \mid
\boldsymbol{\mu_{z|u}},
\boldsymbol{\Sigma_{z|u}}
\right) \\
\boldsymbol{\mu_{z|u}} &=
\bar{\boldsymbol{z}} +
\boldsymbol{\Sigma_{zu}}\boldsymbol{\Sigma_{uu}}^{-1}
(\boldsymbol{u} - \bar{\boldsymbol{u}}) \\
\boldsymbol{\Sigma_{z|u}} &=
\boldsymbol{\Sigma_{zz}} -
\boldsymbol{\Sigma_{zu}}
\boldsymbol{\Sigma_{zz}}^{-1}
\boldsymbol{\Sigma_{uz}}
\end{aligned} p ( z ∣ u ) μ z ∣ u Σ z ∣ u = N ( z ∣ μ z ∣ u , Σ z ∣ u ) = z ˉ + Σ zu Σ uu − 1 ( u − u ˉ ) = Σ zz − Σ zu Σ zz − 1 Σ uz We have the conditional likelihood for the variable, u \boldsymbol{u} u
p ( u ∣ z ) = N ( u ∣ μ u ∣ z , Σ u ∣ z ) μ u ∣ z = u ˉ + Σ z u Σ z z − 1 ( z − z ˉ ) Σ u ∣ z = Σ u u − Σ u z Σ u u − 1 Σ z u \begin{aligned}
p(\boldsymbol{u}|\boldsymbol{z}) &=
\mathcal{N}
\left(
\boldsymbol{u} \mid
\boldsymbol{\mu_{u|z}},
\boldsymbol{\Sigma_{u|z}}
\right) \\
\boldsymbol{\mu_{u|z}} &=
\bar{\boldsymbol{u}} +
\boldsymbol{\Sigma_{zu}}\boldsymbol{\Sigma_{zz}}^{-1}
(\boldsymbol{z} - \bar{\boldsymbol{z}}) \\
\boldsymbol{\Sigma_{u|z}} &=
\boldsymbol{\Sigma_{uu}} -
\boldsymbol{\Sigma_{uz}}
\boldsymbol{\Sigma_{uu}}^{-1}
\boldsymbol{\Sigma_{zu}}
\end{aligned} p ( u ∣ z ) μ u ∣ z Σ u ∣ z = N ( u ∣ μ u ∣ z , Σ u ∣ z ) = u ˉ + Σ zu Σ zz − 1 ( z − z ˉ ) = Σ uu − Σ uz Σ uu − 1 Σ zu Scaling ¶ there
Matrix Inversions ¶ The primary thing we want to do when
Cholesky Decomposition ¶ We can decompose the matrix into a Cholesky which is an upper (or lower) triangular matrix.
C = L L ⊤ \mathbf{C} = \mathbf{LL}^\top C = LL ⊤ We can to the inversion
L − 1 = Inverse ( L ) \mathbf{L}^{-1} = \text{Inverse}(\mathbf{L}) L − 1 = Inverse ( L ) Something that is easier to deal with is the matrix solve:
x = L − 1 b \mathbf{x} = \mathbf{L}^{-1}\mathbf{b} x = L − 1 b For this, we need a special solver
A: Array["D D"] = ...
b: Array["D M"] = ...
I: Array["D D"] = eye_like(A)
# cholesky decomposition
L: Array["D D"] = cholesky(K, lower=True)
L_inv: Array["D D"] = cho_solve(L, I, lower=True)
x: Array["D M"] = cho_solve(L, b, lower=True)
Conjugate Gradient ¶ x ∗ = argmin x x ⊤ A x − \mathbf{x}^* = \underset{\mathbf{x}}{\text{argmin}} \hspace{2mm}
\mathbf{x}^\top\mathbf{A}\mathbf{x} - x ∗ = x argmin x ⊤ Ax − Woodbury Approximation ¶ We can find some lower dimensional subspace.
For example, we can use the SVD decomposition
C ≈ U Λ V ⊤ + σ I \mathbf{C} \approx \mathbf{U}\boldsymbol{\Lambda}\mathbf{V}^\top + \sigma\mathbf{I} C ≈ U Λ V ⊤ + σ I Looking at equation (34) , we can take the inverse.
C − 1 ≈ I − 1 − I − 1 U ( Λ − 1 + V ⊤ I − 1 U ) − 1 V ⊤ D − 1 \mathbf{C}^{-1} \approx
\mathbf{I}^{-1} - \mathbf{I}^{-1}\mathbf{U}
\left(\boldsymbol{\Lambda}^{-1} + \mathbf{V}^\top\mathbf{I}^{-1}\mathbf{U}\right)^{-1}\mathbf{V}^\top\mathbf{D}^{-1} C − 1 ≈ I − 1 − I − 1 U ( Λ − 1 + V ⊤ I − 1 U ) − 1 V ⊤ D − 1 Inducing Points ¶ We can use a subset of the points and calculate th covariance.
C y y ≈ C y r C r r − 1 C y r ⊤ + I \mathbf{C_{yy}} \approx \mathbf{C_{yr}C_{rr}}^{-1}\mathbf{C_{yr}}^\top + \mathbf{I} C yy ≈ C yr C rr − 1 C yr ⊤ + I Now, we can easily find the inverse
C y y − 1 ≈ C y r C r r − 1 C y r ⊤ \mathbf{C_{yy}}^{-1} \approx
\mathbf{C_{yr}C_{rr}}^{-1}\mathbf{C_{yr}}^\top C yy − 1 ≈ C yr C rr − 1 C yr ⊤ Approximate Conditional Distributions ¶ p ( z , u ) ∼ N ( [ z u ] ∣ [ m ^ z m ^ z ] , [ C ^ z z C ^ z u C ^ u z C ^ u u ] ) p(\boldsymbol{z},\boldsymbol{u})
\sim \mathcal{N}
\left(
\begin{bmatrix}
\boldsymbol{z}\\
\boldsymbol{u}
\end{bmatrix}
\mid
\begin{bmatrix}
\boldsymbol{\hat{m}_z} \\
\boldsymbol{\hat{m}_z}
\end{bmatrix},
\begin{bmatrix}
\boldsymbol{\hat{C}_{zz}} & \boldsymbol{\hat{C}_{zu}}\\
\boldsymbol{\hat{C}_{uz}} & \boldsymbol{\hat{C}_{uu}}
\end{bmatrix}
\right) p ( z , u ) ∼ N ( [ z u ] ∣ [ m ^ z m ^ z ] , [ C ^ zz C ^ uz C ^ zu C ^ uu ] ) We have each of the terms as
Mean : m ^ z = E [ z ∣ Y ] = ∫ f ( z ) p ( z ) d z Marginal Covariance : C ^ z z = Cov [ z ] = ∫ ( f ( z ) − m ^ z ) ( f ( z ) − m ^ z ) ⊤ p ( z ) d z Mean : y ^ = E [ y ∣ Y ] = ∫ h ( z ) p ( z ) d z Marginal Covariance : C ^ y y = Cov [ z ] = ∫ ( h ( y ) − y ^ ) ( h ( y ) − y ^ ) ⊤ p ( z ) d z Cross-Covariance : C ^ z y = Cov [ z ] = ∫ ( f ( z ) − m ^ z ) ( h ( z ) − y ^ ) ⊤ p ( z ) d z \begin{aligned}
\text{Mean}: && &&
\boldsymbol{\hat{m}_z} &=
\mathbb{E}\left[\boldsymbol{z}|\mathbf{Y}\right] =
\int\boldsymbol{f}(\boldsymbol{z})p(\boldsymbol{z})d\boldsymbol{z}\\
\text{Marginal Covariance}: && &&
\boldsymbol{\hat{C}_{zz}} &=
\text{Cov}\left[\boldsymbol{z}\right] =
\int\left(\boldsymbol{f}(\boldsymbol{z}) - \boldsymbol{\hat{m}_z}\right)
\left(\boldsymbol{f}(\boldsymbol{z}) - \boldsymbol{\hat{m}_z}\right)^\top
p(\boldsymbol{z})d\boldsymbol{z}
\\
\text{Mean}: && &&
\boldsymbol{\hat{y}} &=
\mathbb{E}\left[\boldsymbol{y}|\mathbf{Y}\right] =
\int\boldsymbol{h}(\boldsymbol{z})p(\boldsymbol{z})d\boldsymbol{z}\\
\text{Marginal Covariance}: && &&
\boldsymbol{\hat{C}_{yy}} &=
\text{Cov}\left[\boldsymbol{z}\right] =
\int\left(\boldsymbol{h}(\boldsymbol{y}) - \boldsymbol{\hat{y}}\right)
\left(\boldsymbol{h}(\boldsymbol{y}) - \boldsymbol{\hat{y}}\right)^\top
p(\boldsymbol{z})d\boldsymbol{z}
\\
\text{Cross-Covariance}: && &&
\boldsymbol{\hat{C}_{zy}} &=
\text{Cov}\left[\boldsymbol{z}\right] =
\int\left(\boldsymbol{f}(\boldsymbol{z}) - \boldsymbol{\hat{m}_z}\right)
\left(\boldsymbol{h}(\boldsymbol{z}) - \boldsymbol{\hat{y}}\right)^\top
p(\boldsymbol{z})d\boldsymbol{z}
\end{aligned} Mean : Marginal Covariance : Mean : Marginal Covariance : Cross-Covariance : m ^ z C ^ zz y ^ C ^ yy C ^ zy = E [ z ∣ Y ] = ∫ f ( z ) p ( z ) d z = Cov [ z ] = ∫ ( f ( z ) − m ^ z ) ( f ( z ) − m ^ z ) ⊤ p ( z ) d z = E [ y ∣ Y ] = ∫ h ( z ) p ( z ) d z = Cov [ z ] = ∫ ( h ( y ) − y ^ ) ( h ( y ) − y ^ ) ⊤ p ( z ) d z = Cov [ z ] = ∫ ( f ( z ) − m ^ z ) ( h ( z ) − y ^ ) ⊤ p ( z ) d z We have the conditional likelihood for the variable, z \boldsymbol{z} z
p ( z ∣ u ) = N ( z ∣ μ z ∣ u , Σ z ∣ u ) μ z ∣ u = z ˉ + Σ z u Σ u u − 1 ( u − u ˉ ) Σ z ∣ u = Σ z z − Σ z u Σ z z − 1 Σ u z \begin{aligned}
p(\boldsymbol{z}|\boldsymbol{u}) &=
\mathcal{N}
\left(
\boldsymbol{z} \mid
\boldsymbol{\mu_{z|u}},
\boldsymbol{\Sigma_{z|u}}
\right) \\
\boldsymbol{\mu_{z|u}} &=
\bar{\boldsymbol{z}} +
\boldsymbol{\Sigma_{zu}}\boldsymbol{\Sigma_{uu}}^{-1}
(\boldsymbol{u} - \bar{\boldsymbol{u}}) \\
\boldsymbol{\Sigma_{z|u}} &=
\boldsymbol{\Sigma_{zz}} -
\boldsymbol{\Sigma_{zu}}
\boldsymbol{\Sigma_{zz}}^{-1}
\boldsymbol{\Sigma_{uz}}
\end{aligned} p ( z ∣ u ) μ z ∣ u Σ z ∣ u = N ( z ∣ μ z ∣ u , Σ z ∣ u ) = z ˉ + Σ zu Σ uu − 1 ( u − u ˉ ) = Σ zz − Σ zu Σ zz − 1 Σ uz We have the conditional likelihood for the variable, u \boldsymbol{u} u
p ( u ∣ z ) = N ( u ∣ μ u ∣ z , Σ u ∣ z ) μ u ∣ z = u ˉ + Σ z u Σ z z − 1 ( z − z ˉ ) Σ u ∣ z = Σ u u − Σ u z Σ u u − 1 Σ z u \begin{aligned}
p(\boldsymbol{u}|\boldsymbol{z}) &=
\mathcal{N}
\left(
\boldsymbol{u} \mid
\boldsymbol{\mu_{u|z}},
\boldsymbol{\Sigma_{u|z}}
\right) \\
\boldsymbol{\mu_{u|z}} &=
\bar{\boldsymbol{u}} +
\boldsymbol{\Sigma_{zu}}\boldsymbol{\Sigma_{zz}}^{-1}
(\boldsymbol{z} - \bar{\boldsymbol{z}}) \\
\boldsymbol{\Sigma_{u|z}} &=
\boldsymbol{\Sigma_{uu}} -
\boldsymbol{\Sigma_{uz}}
\boldsymbol{\Sigma_{uu}}^{-1}
\boldsymbol{\Sigma_{zu}}
\end{aligned} p ( u ∣ z ) μ u ∣ z Σ u ∣ z = N ( u ∣ μ u ∣ z , Σ u ∣ z ) = u ˉ + Σ zu Σ zz − 1 ( z − z ˉ ) = Σ uu − Σ uz Σ uu − 1 Σ zu Linear Conditional Gaussian Model ¶ We have a latent variable which is Gaussian distributed:
p ( z ) ∼ N ( z ∣ z ˉ , Σ z ) p(\boldsymbol{z}) \sim \mathcal{N}(\boldsymbol{z}\mid\boldsymbol{\bar{z}},\boldsymbol{\Sigma_z}) p ( z ) ∼ N ( z ∣ z ˉ , Σ z ) We have a QoI which we believe is a linear transformation of the latent variable
p ( u ) ∼ N ( u ∣ A z + b , Σ u ) p(\boldsymbol{u}) \sim \mathcal{N}
\left(
\boldsymbol{u}\mid
\mathbf{A}\boldsymbol{z} + \mathbf{b},
\boldsymbol{\Sigma_u}
\right) p ( u ) ∼ N ( u ∣ A z + b , Σ u ) Recall the joint distribution given in equation (6) .
We can write each of the terms as:
Σ z z = Σ z \boldsymbol{\Sigma_{zz}}=\boldsymbol{\Sigma_{z}} Σ zz = Σ z Σ u u = Σ u \boldsymbol{\Sigma_{uu}}=\boldsymbol{\Sigma_{u}} Σ uu = Σ u Σ u z = Σ u \boldsymbol{\Sigma_{uz}}=\boldsymbol{\Sigma_{u}} Σ uz = Σ u u ˉ = A z + b \boldsymbol{\bar{u}}=\mathbf{A}\boldsymbol{z} + \mathbf{b} u ˉ = A z + b Taylor Expansion ¶ [ x y ] ∼ N ( [ μ x f ( x ) ] , [ Σ x C C ⊤ Π ] ) \begin{bmatrix}
\mathbf{x} \\
y
\end{bmatrix}
\sim \mathcal{N} \left(
\begin{bmatrix}
\mu_\mathbf{x} \\
f(\mathbf{x})
\end{bmatrix},
\begin{bmatrix}
\Sigma_\mathbf{x} & C \\
C^\top & \Pi
\end{bmatrix}
\right) [ x y ] ∼ N ( [ μ x f ( x ) ] , [ Σ x C ⊤ C Π ] ) Taylor Expansion ¶ f ( x ) = f ( μ x + δ x ) ≈ f ( μ x ) + ∇ x f ( μ x ) δ x + 1 2 ∑ i δ x ⊤ ∇ x x ( i ) f ( μ x ) δ x e i + … \begin{aligned}
f(\mathbf{x}) &= f(\mu_x + \delta_x) \\
&\approx f(\mu_x) + \nabla_x f(\mu_x)\delta_x + \frac{1}{2}\sum_i \delta_x^\top \nabla_{xx}^{(i)}f(\mu_x)\delta_x e_i + \ldots
\end{aligned} f ( x ) = f ( μ x + δ x ) ≈ f ( μ x ) + ∇ x f ( μ x ) δ x + 2 1 i ∑ δ x ⊤ ∇ xx ( i ) f ( μ x ) δ x e i + … Joint Distribution ¶ E x [ f ~ ( x ) ] , V x [ f ~ ( x ) ] \mathbb{E}_\mathbf{x}\left[ \tilde{f}(\mathbf{x}) \right], \mathbb{V}_\mathbf{x}\left[ \tilde{f}(\mathbf{x}) \right] E x [ f ~ ( x ) ] , V x [ f ~ ( x ) ] Mean Function ¶ E x [ f ~ ( x ) ] = E x [ f ~ ( μ x ) + ∇ x f ( μ x ) ϵ x ] = E x [ f ~ ( μ x ) ] + E x [ ∇ x f ( μ x ) ϵ x ] = f ~ ( μ x ) + ∇ x E x [ f ( x ) ϵ x ] = f ~ ( μ x ) \begin{aligned}
\mathbb{E}_\mathbf{x}\left[ \tilde{f}(\mathbf{x}) \right] &=
\mathbb{E}_\mathbf{x}\left[ \tilde{f}(\mu_\mathbf{x}) + \nabla_\mathbf{x}f(\mu_\mathbf{x})\epsilon_\mathbf{x} \right] \\
&= \mathbb{E}_\mathbf{x}\left[ \tilde{f}(\mu_\mathbf{x}) \right] +
\mathbb{E}_\mathbf{x}\left[ \nabla_\mathbf{x}f(\mu_\mathbf{x})\epsilon_\mathbf{x} \right] \\
&= \tilde{f}(\mu_\mathbf{x}) +
\nabla_\mathbf{x}\mathbb{E}_\mathbf{x}\left[ f(\mathbf{x})\epsilon_\mathbf{x} \right] \\
&= \tilde{f}(\mu_\mathbf{x})\\
\end{aligned} E x [ f ~ ( x ) ] = E x [ f ~ ( μ x ) + ∇ x f ( μ x ) ϵ x ] = E x [ f ~ ( μ x ) ] + E x [ ∇ x f ( μ x ) ϵ x ] = f ~ ( μ x ) + ∇ x E x [ f ( x ) ϵ x ] = f ~ ( μ x ) Sample & Population Moments ¶ Matrix : Z = [ z 1 , z 2 , … , z N ] ⊤ , Z ∈ R N × D Sample Mean : Z ^ = 1 N ∑ n = 1 N z n , Z ^ ∈ R D Sample Variance : σ ^ z = 1 N − 1 ∑ n = 1 N ( z n − z ^ n ) 2 σ ^ z ∈ R D Sample Covariance : Σ ^ z = 1 N − 1 ∑ n = 1 N ( z n − z ^ n ) ( z n − z ^ n ) ⊤ Σ ^ z ∈ R D × D Population Mean : μ ^ z = 1 D ∑ d = 1 D z d , μ ^ z ∈ R N Population Variance : ν ^ z = 1 D ∑ d = 1 D ( z d − μ z ^ ) 2 ν ^ z ∈ R N Population Covariance : K ^ z = 1 D ∑ d = 1 D ( z d − μ z ^ ) ( z d − μ z ^ ) ⊤ K ^ z ∈ R N × N \begin{aligned}
\text{Matrix}: && &&
\mathbf{Z} &=
\left[\mathbf{z}_1,\mathbf{z}_2,\ldots,\mathbf{z}_N\right]^\top, && &&
\mathbf{Z}\in\mathbb{R}^{N\times D} \\
\text{Sample Mean}: && &&
\hat{\mathbf{Z}} &=
\frac{1}{N}\sum_{n=1}^N \mathbf{z}_n, && &&
\hat{\mathbf{Z}}\in\mathbb{R}^{D} \\
\text{Sample Variance}: && &&
\hat{\boldsymbol{\sigma}}_{\mathbf{z}} &=
\frac{1}{N-1}\sum_{n=1}^N
\left(\mathbf{z}_n - \hat{\mathbf{z}}_n\right)^2 && &&
\hat{\boldsymbol{\sigma}}_{\mathbf{z}}\in\mathbb{R}^{D} \\
\text{Sample Covariance}: && &&
\hat{\boldsymbol{\Sigma}}_{\mathbf{z}} &=
\frac{1}{N-1}\sum_{n=1}^N
\left(\mathbf{z}_n - \hat{\mathbf{z}}_n\right)
\left(\mathbf{z}_n - \hat{\mathbf{z}}_n\right)^\top && &&
\hat{\boldsymbol{\Sigma}}_{\mathbf{z}}\in\mathbb{R}^{D\times D} \\
\text{Population Mean}: && &&
\hat{\boldsymbol{\mu}}_\mathbf{z} &=
\frac{1}{D}\sum_{d=1}^D \mathbf{z}_d, && &&
\hat{\boldsymbol{\mu}}_\mathbf{z}\in\mathbb{R}^{N} \\
\text{Population Variance}: && &&
\hat{\boldsymbol{\nu}}_{\mathbf{z}} &=
\frac{1}{D}\sum_{d=1}^D
\left(\mathbf{z}_d - \hat{\boldsymbol{\mu}_\mathbf{z}}\right)^2 && &&
\hat{\boldsymbol{\nu}}_{\mathbf{z}}\in\mathbb{R}^{N} \\
\text{Population Covariance}: && &&
\hat{\mathbf{K}}_{\mathbf{z}} &=
\frac{1}{D}\sum_{d=1}^D
\left(\mathbf{z}_d - \hat{\boldsymbol{\mu}_\mathbf{z}}\right)
\left(\mathbf{z}_d -\hat{\boldsymbol{\mu}_\mathbf{z}}\right)^\top && &&
\hat{\mathbf{K}}_{\mathbf{z}}\in\mathbb{R}^{N\times N} \\
\end{aligned} Matrix : Sample Mean : Sample Variance : Sample Covariance : Population Mean : Population Variance : Population Covariance : Z Z ^ σ ^ z Σ ^ z μ ^ z ν ^ z K ^ z = [ z 1 , z 2 , … , z N ] ⊤ , = N 1 n = 1 ∑ N z n , = N − 1 1 n = 1 ∑ N ( z n − z ^ n ) 2 = N − 1 1 n = 1 ∑ N ( z n − z ^ n ) ( z n − z ^ n ) ⊤ = D 1 d = 1 ∑ D z d , = D 1 d = 1 ∑ D ( z d − μ z ^ ) 2 = D 1 d = 1 ∑ D ( z d − μ z ^ ) ( z d − μ z ^ ) ⊤ Z ∈ R N × D Z ^ ∈ R D σ ^ z ∈ R D Σ ^ z ∈ R D × D μ ^ z ∈ R N ν ^ z ∈ R N K ^ z ∈ R N × N Examples :
Global Mean Surface Temperature, x ∈ R N × D x\in\mathbb{R}^{N\times D} x ∈ R N × D , N = Models N=\text{Models} N = Models , D = ∑ D T D Ω D=\sum D_T D_\Omega D = ∑ D T D Ω Spatial Scene, x ∈ R N × D x\in\mathbb{R}^{N\times D} x ∈ R N × D , N = Ensembles N=\text{Ensembles} N = Ensembles , D = Space D=\text{Space} D = Space Spatiotemporal Trajectory, x ∈ R N × D x\in\mathbb{R}^{N\times D} x ∈ R N × D , N = Space/Time N=\text{Space/Time} N = Space/Time , D = Time/Space D=\text{Time/Space} D = Time/Space Ensemble of Trajectories, x ∈ R N × D x\in\mathbb{R}^{N\times D} x ∈ R N × D , N = Ensembles N=\text{Ensembles} N = Ensembles , D = Time x Space D=\text{Time x Space} D = Time x Space Gaussian Approximation Algorithm ¶ Moment Estimation ¶ Samples ¶ Matrix : Z = [ z 1 , z 2 , … , z N ] , Z ∈ R D × N Perturbation Matrix : P = Z − z ^ , P ∈ R D × N \begin{aligned}
\text{Matrix}: && &&
\mathbf{Z} &=
\left[\mathbf{z}_1,\mathbf{z}_2,\ldots,\mathbf{z}_N\right], && &&
\mathbf{Z}\in\mathbb{R}^{D\times N} \\
\text{Perturbation Matrix}: && &&
\mathbf{P} &= \mathbf{Z} - \hat{\mathbf{z}}, && &&
\mathbf{P}\in\mathbb{R}^{D \times N}
\end{aligned} Matrix : Perturbation Matrix : Z P = [ z 1 , z 2 , … , z N ] , = Z − z ^ , Z ∈ R D × N P ∈ R D × N We can do all of these operations in matrix form.
Sample Mean : z ^ = 1 N Z ⋅ 1 , z ^ ∈ R D Perturbation Matrix : P ^ = Z ⋅ ( I N − 1 N 11 ⊤ ) , P ^ ∈ R D × D Sample Covariance : Σ ^ z = 1 N − 1 P ^ P ^ ⊤ Σ ^ z ∈ R D × D \begin{aligned}
\text{Sample Mean}: && &&
\hat{\mathbf{z}} &= \frac{1}{N}\mathbf{Z}\cdot\mathbf{1}, && &&
\hat{\mathbf{z}}\in\mathbb{R}^{D} \\
\text{Perturbation Matrix}: && &&
\hat{\mathbf{P}} &= \mathbf{Z}\cdot\left(\mathbf{I}_N - \frac{1}{N}\mathbf{11}^\top\right), && &&
\hat{\mathbf{P}}\in\mathbb{R}^{D\times D} \\
\text{Sample Covariance}: && &&
\hat{\boldsymbol{\Sigma}}_{\mathbf{z}} &=
\frac{1}{N-1} \hat{\mathbf{P}}\hat{\mathbf{P}}^\top && &&
\hat{\boldsymbol{\Sigma}}_{\mathbf{z}}\in\mathbb{R}^{D\times D} \\
\end{aligned} Sample Mean : Perturbation Matrix : Sample Covariance : z ^ P ^ Σ ^ z = N 1 Z ⋅ 1 , = Z ⋅ ( I N − N 1 11 ⊤ ) , = N − 1 1 P ^ P ^ ⊤ z ^ ∈ R D P ^ ∈ R D × D Σ ^ z ∈ R D × D Note : the perturbation matrix in this form is equivalent to the kernel centering operation (see scikit-learn docs ).
It allows one to center the gram matrix without explicitly computing the mapping.
Population ¶ Matrix : Z = [ z 1 , z 2 , … , z N ] ⊤ , Z ∈ R N × D Perturbation Matrix : P = Z − z ^ , P ∈ R N × D \begin{aligned}
\text{Matrix}: && &&
\mathbf{Z} &=
\left[\mathbf{z}_1,\mathbf{z}_2,\ldots,\mathbf{z}_N\right]^\top, && &&
\mathbf{Z}\in\mathbb{R}^{N\times D} \\
\text{Perturbation Matrix}: && &&
\mathbf{P} &= \mathbf{Z} - \hat{\mathbf{z}}, && &&
\mathbf{P}\in\mathbb{R}^{N \times D}
\end{aligned} Matrix : Perturbation Matrix : Z P = [ z 1 , z 2 , … , z N ] ⊤ , = Z − z ^ , Z ∈ R N × D P ∈ R N × D We can do all of these operations in matrix form.
Population Mean : μ ^ z = 1 D Z ⋅ 1 , μ ^ z ∈ R N Perturbation Matrix : P ^ = Z ⋅ ( I N − 1 D 11 ⊤ ) , P ^ ∈ R N × N Population Covariance : K ^ z = 1 D P ^ P ^ ⊤ K ^ z ∈ R N × N \begin{aligned}
\text{Population Mean}: && &&
\hat{\boldsymbol{\mu}}_\mathbf{z} &=
\frac{1}{D}\mathbf{Z}\cdot\mathbf{1}, && &&
\hat{\boldsymbol{\mu}}_\mathbf{z}\in\mathbb{R}^{N} \\
\text{Perturbation Matrix}: && &&
\hat{\mathbf{P}} &= \mathbf{Z}\cdot\left(\mathbf{I}_N - \frac{1}{D}\mathbf{11}^\top\right), && &&
\hat{\mathbf{P}}\in\mathbb{R}^{N\times N} \\
\text{Population Covariance}: && &&
\hat{\mathbf{K}}_{\mathbf{z}} &=
\frac{1}{D} \hat{\mathbf{P}}\hat{\mathbf{P}}^\top && &&
\hat{\mathbf{K}}_{\mathbf{z}}\in\mathbb{R}^{N\times N}
\end{aligned} Population Mean : Perturbation Matrix : Population Covariance : μ ^ z P ^ K ^ z = D 1 Z ⋅ 1 , = Z ⋅ ( I N − D 1 11 ⊤ ) , = D 1 P ^ P ^ ⊤ μ ^ z ∈ R N P ^ ∈ R N × N K ^ z ∈ R N × N Matrix Identities ¶ ( A + U C V ⊤ ) − 1 = A − 1 − A − 1 U ( C − 1 + V ⊤ A − 1 U ) − 1 V ⊤ A − 1 \left( \mathbf{A}+\mathbf{UCV}^\top\right)^{-1} =
\mathbf{A}^{-1} - \mathbf{A}^{-1}\mathbf{U}
\left(\mathbf{C}^{-1} + \mathbf{V}^\top\mathbf{A}^{-1}\mathbf{U}\right)^{-1}
\mathbf{V}^{\top}\mathbf{A}^{-1} ( A + UCV ⊤ ) − 1 = A − 1 − A − 1 U ( C − 1 + V ⊤ A − 1 U ) − 1 V ⊤ A − 1 This is basically the same as the Woodbury formula (34) except the matrix, A \mathbf{A} A , is the identity, I \mathbf{I} I and the decomposition is between
( I + U V ⊤ ) − 1 = I D − U ( I d + V ⊤ U ) − 1 V ⊤ \left( \mathbf{I}+\mathbf{UV}^\top\right)^{-1} =
\mathbf{I}_D - \mathbf{U}
\left(\mathbf{I}_d + \mathbf{V}^\top\mathbf{U}\right)^{-1}
\mathbf{V}^{\top} ( I + UV ⊤ ) − 1 = I D − U ( I d + V ⊤ U ) − 1 V ⊤ There are some lemmas to this which have been shown to be very useful in practice.
A B ⊤ ( C + B A B ⊤ ) − 1 = A − 1 − A − 1 U ( C − 1 + V A − 1 U ) − 1 V A − 1 ( C − 1 + B ⊤ A − 1 B ) − 1 = C − 1 − C B ⊤ ( B C B ⊤ + A ) − 1 B C \begin{aligned}
\begin{aligned}
\mathbf{AB^\top(C+BAB^\top)^{-1}} &=
\mathbf{A^{-1} - A^{-1}U(C^{-1}+VA^{-1}U)^{-1}VA^{-1}} \\
\mathbf{(C^{-1}+B^\top A^{-1}B)^{-1}} &=
\mathbf{C^{-1} - CB^\top(BCB^\top+A)^{-1}BC} \\
\end{aligned}
\end{aligned} A B ⊤ ( C + BA B ⊤ ) − 1 ( C − 1 + B ⊤ A − 1 B ) − 1 = A − 1 − A − 1 U ( C − 1 + V A − 1 U ) − 1 V A − 1 = C − 1 − C B ⊤ ( BC B ⊤ + A ) − 1 BC Sylvester Determinant Lemma ¶ ∣ A + U Λ V ⊤ ∣ = ∣ A ∣ ∣ Λ ∣ ∣ Λ − 1 − U ⊤ A − 1 V ∣ ∣ A + U V ⊤ ∣ = ∣ A ∣ ∣ I d − U ⊤ A − 1 V ∣ \begin{aligned}
\left| \mathbf{A}+\mathbf{U}\boldsymbol{\Lambda}\mathbf{V}^\top\right| &=
\left|\mathbf{A}\right|
\left|\boldsymbol{\Lambda}\right|
\left|\boldsymbol{\Lambda}^{-1} - \mathbf{U}^\top\mathbf{A}^{-1}\mathbf{V}\right| \\
\left| \mathbf{A}+\mathbf{UV}^\top\right| &=
\left|\mathbf{A}\right|
\left|\mathbf{I}_d - \mathbf{U}^\top\mathbf{A}^{-1}\mathbf{V}\right| \\
\end{aligned} ∣ ∣ A + U Λ V ⊤ ∣ ∣ ∣ ∣ A + UV ⊤ ∣ ∣ = ∣ A ∣ ∣ Λ ∣ ∣ ∣ Λ − 1 − U ⊤ A − 1 V ∣ ∣ = ∣ A ∣ ∣ ∣ I d − U ⊤ A − 1 V ∣ ∣ Weinsten-Aronszajin Identity ¶ ∣ I D + U V ⊤ ∣ = ∣ I d − U ⊤ V ∣ \left| \mathbf{I}_D+\mathbf{UV}^\top\right| =
\left|\mathbf{I}_d - \mathbf{U}^\top\mathbf{V}\right| ∣ ∣ I D + UV ⊤ ∣ ∣ = ∣ ∣ I d − U ⊤ V ∣ ∣ Decompositions ¶ Eigenvalue Decomposition ¶ K ≈ U Λ V ⊤ \mathbf{K} \approx \mathbf{U}\boldsymbol{\Lambda}\mathbf{V}^\top K ≈ U Λ V ⊤ Nystrom Approximation ¶ K ≈ U Λ V ⊤ \mathbf{K} \approx \mathbf{U}\boldsymbol{\Lambda}\mathbf{V}^\top K ≈ U Λ V ⊤ Inversion ¶ We can use the matrix inversion properties from equation (34) to decompose the Nyström approximation into a cheaper inversion.
K − 1 = ( K + σ 2 I N ) − 1 ≈ ( U Λ V ⊤ + σ 2 I N ) − 1 = σ − 2 I N + σ − 4 U ( Λ − 1 + σ − 2 V ⊤ U ) − 1 V ⊤ \begin{aligned}
\mathbf{K}^{-1} &=
\left( \mathbf{K} + \sigma^{2}\mathbf{I}_N\right)^{-1} \\
&\approx
\left( \mathbf{U}\boldsymbol{\Lambda}\boldsymbol{V}^\top + \sigma^{2}\mathbf{I}_N\right)^{-1} \\
&= \sigma^{-2}\mathbf{I}_N + \sigma^{-4}\mathbf{U}
\left( \boldsymbol{\Lambda}^{-1} + \sigma^{-2}\mathbf{V}^\top\mathbf{U}\right)^{-1}
\mathbf{V}^\top
\end{aligned} K − 1 = ( K + σ 2 I N ) − 1 ≈ ( U Λ V ⊤ + σ 2 I N ) − 1 = σ − 2 I N + σ − 4 U ( Λ − 1 + σ − 2 V ⊤ U ) − 1 V ⊤ Determinant ¶ We can use the determinant inversion properties from equation (37) to decompose the Nyström approximation to be cheaper.
∣ K ∣ = ∣ K + σ 2 I N ∣ ≈ ∣ U Λ V ⊤ + σ 2 I N ∣ = σ 2 ∣ Λ ∣ ∣ Λ − 1 + σ − 2 V ⊤ U ∣ \begin{aligned}
\left|\mathbf{K}\right| &=
\left| \mathbf{K} + \sigma^{2}\mathbf{I}_N\right| \\
&\approx
\left| \mathbf{U}\boldsymbol{\Lambda}\boldsymbol{V}^\top + \sigma^{2}\mathbf{I}_N\right| \\
&=
\sigma^2
\left|\boldsymbol{\Lambda}\right|
\left|\boldsymbol{\Lambda}^{-1} + \sigma^{-2}\mathbf{V}^\top\mathbf{U}\right|
\end{aligned} ∣ K ∣ = ∣ ∣ K + σ 2 I N ∣ ∣ ≈ ∣ ∣ U Λ V ⊤ + σ 2 I N ∣ ∣ = σ 2 ∣ Λ ∣ ∣ ∣ Λ − 1 + σ − 2 V ⊤ U ∣ ∣ Random Fourier Features ¶ K ≈ L L ⊤ \mathbf{K} \approx \mathbf{L}\mathbf{L}^\top K ≈ L L ⊤ Inversion ¶ We can use the matrix inversion properties from equation (34) to decompose the Nyström approximation into a cheaper inversion.
K − 1 = ( K + σ 2 I N ) − 1 ≈ ( U Λ V ⊤ + σ 2 I N ) − 1 = σ − 2 I N + σ − 4 U ( I d + σ − 2 V ⊤ U ) − 1 V ⊤ \begin{aligned}
\mathbf{K}^{-1} &=
\left( \mathbf{K} + \sigma^{2}\mathbf{I}_N\right)^{-1} \\
&\approx
\left( \mathbf{U}\boldsymbol{\Lambda}\boldsymbol{V}^\top + \sigma^{2}\mathbf{I}_N\right)^{-1} \\
&= \sigma^{-2}\mathbf{I}_N + \sigma^{-4}\mathbf{U}
\left( \mathbf{I}_d + \sigma^{-2}\mathbf{V}^\top\mathbf{U}\right)^{-1}
\mathbf{V}^\top
\end{aligned} K − 1 = ( K + σ 2 I N ) − 1 ≈ ( U Λ V ⊤ + σ 2 I N ) − 1 = σ − 2 I N + σ − 4 U ( I d + σ − 2 V ⊤ U ) − 1 V ⊤ Determinant ¶ We can use the determinant inversion properties from equation (37) to decompose the Nyström approximation to be cheaper.
∣ K ∣ = ∣ K + σ 2 I N ∣ ≈ ∣ L L ⊤ + σ 2 I N ∣ = σ 2 ∣ I d + σ − 2 L ⊤ L ∣ \begin{aligned}
\left|\mathbf{K}\right| &=
\left| \mathbf{K} + \sigma^{2}\mathbf{I}_N\right| \\
&\approx
\left| \mathbf{L}\mathbf{L}^\top + \sigma^{2}\mathbf{I}_N\right| \\
&=
\sigma^2
\left|\mathbf{I}_d + \sigma^{-2}\mathbf{L}^\top\mathbf{L}\right|
\end{aligned} ∣ K ∣ = ∣ ∣ K + σ 2 I N ∣ ∣ ≈ ∣ ∣ L L ⊤ + σ 2 I N ∣ ∣ = σ 2 ∣ ∣ I d + σ − 2 L ⊤ L ∣ ∣ Inducing Points ¶ Inducing Point Kernel : K u u = K ( U ) Cross Kernel : K u x = K ( U , X ) Decomposition : K u u = L u u L u u ⊤ Approximate Kernel : K x x ≈ K x u K u u − 1 K u x \begin{aligned}
\text{Inducing Point Kernel}: && &&
\mathbf{K}_{\mathbf{uu}} &= \boldsymbol{K}(\mathbf{U}) \\
\text{Cross Kernel}: && &&
\mathbf{K}_{\mathbf{ux}} &= \boldsymbol{K}(\mathbf{U}, \mathbf{X}) \\
\text{Decomposition}: && &&
\mathbf{K_{uu}} &= \mathbf{L_{uu}L_{uu}}^\top \\
\text{Approximate Kernel}: && &&
\mathbf{K_{xx}} &\approx \mathbf{K_{xu}}\mathbf{K_{uu}}^{-1}\mathbf{K_{ux}} \\
\end{aligned} Inducing Point Kernel : Cross Kernel : Decomposition : Approximate Kernel : K uu K ux K uu K xx = K ( U ) = K ( U , X ) = L uu L uu ⊤ ≈ K xu K uu − 1 K ux We can demonstrate the decomposition as:
K x x ≈ K x u K u u − 1 K u x = K x u ( L u u L u u ⊤ ) − 1 K u x = K x u ( L u u − 1 ) ( L u u − 1 ) ⊤ K u x = W W ⊤ W = ( L u u K u x ) ⊤ \begin{aligned}
\mathbf{K_{xx}}
&\approx \mathbf{K_{xu}}\mathbf{K_{uu}}^{-1}\mathbf{K_{ux}} \\
&= \mathbf{K_{xu}}\left(\mathbf{L_{uu}L_{uu}}^\top\right)^{-1}\mathbf{K_{ux}} \\
&= \mathbf{K_{xu}}\left(\mathbf{L_{uu}}^{-1}\right)\left(\mathbf{L_{uu}}^{-1}\right)^\top\mathbf{K_{ux}} \\
&= \mathbf{WW}^\top\\
\mathbf{W} &= \left( \mathbf{L_{uu}K_{ux}}\right)^\top
\end{aligned} K xx W ≈ K xu K uu − 1 K ux = K xu ( L uu L uu ⊤ ) − 1 K ux = K xu ( L uu − 1 ) ( L uu − 1 ) ⊤ K ux = WW ⊤ = ( L uu K ux ) ⊤