Linearization (Taylor Expansions)¶

Conditional Gaussian Distributions
I: Additive Noise Model ( $x,f$ )
- Other GP Methods
II: Non-Additive Noise Model
III: Quadratic Approximation
Literature
Supplementary

Analytical Moments¶

The posterior of this distribution is non-Gaussian because we have to propagate a probability distribution through a non-linear kernel function. So this integral becomes intractable. We can compute the analytical Gaussian approximation by only computing the mean and the variance of the

Mean Function¶

$\begin{aligned} m(\mu_\mathbf{x}, \Sigma_\mathbf{x}) &= \mathbb{E}_\mathbf{f_*} \left[ f_* \, \mathbb{E}_\mathbf{x_*} \left[ p(f_*|\mathbf{x}_*) \right] \right] \\ &= \mathbb{E}_\mathbf{x_*} \left[ \mathbb{E}_{f_*} \left[ f_* \,p(f_* | \mathbf{x_*}) \right]\right]\\ &= \mathbb{E}_{x_*}\left[ \mu_\text{GP}(\mathbf{x_*}) \right] \end{aligned}$

Variance Function¶

The variance term is a bit more complex.

$\begin{aligned} v(\mu_\mathbf{x}, \Sigma_\mathbf{x}) &= \mathbb{E}_\mathbf{f_*} \left[ f_*^2 \, \mathbb{E}_\mathbf{x_*} \left[ p(f_*|\mathbf{x}_*) \right] \right] - \left(\mathbb{E}_\mathbf{f_*} \left[ f_* \, \mathbb{E}_\mathbf{x_*} \left[ p(f_*|\mathbf{x}_*) \right] \right]\right)^2\\ &= \mathbb{E}_\mathbf{x_*} \left[ \mathbb{E}_\mathbf{x_*} \left[ f_*^2 \, p(f_*|\mathbf{x}_*) \right] \right] - \left(\mathbb{E}_\mathbf{x_*} \left[ \mathbb{E}_\mathbf{x_*} \left[ f_* \, p(f_*|\mathbf{x}_*) \right] \right]\right)^2\\ &= \mathbb{E}_\mathbf{x_*} \left[ \sigma_\text{GP}^2(\mathbf{x}_*) + \mu_\text{GP}^2(\mathbf{x}_*) \right] - \mathbb{E}_{x_*}\left[ \mu_\text{GP}(\mathbf{x_*}) \right]^2 \\ &= \mathbb{E}_\mathbf{x_*} \left[ \sigma_\text{GP}^2(\mathbf{x}_*) \right] + \mathbb{E}_\mathbf{x_*} \left[ \mu_\text{GP}^2(\mathbf{x}_*) \right] - \mathbb{E}_{x_*}\left[ \mu_\text{GP}(\mathbf{x_*}) \right]^2\\ &= \mathbb{E}_\mathbf{x_*} \left[ \sigma_\text{GP}^2(\mathbf{x}_*) \right] + \mathbb{V}_\mathbf{x_*} \left[\mu_\text{GP}(\mathbf{x}_*) \right] \end{aligned}$

Taylor Approximation¶

We will approximate our mean and variance function via a Taylor Expansion. First the mean function:

$\begin{aligned} \mathbf{z}_\mu = \mu_\text{GP}(\mathbf{x_*})= \mu_\text{GP}(\mu_\mathbf{x_*}) + \nabla \mu_\text{GP}\bigg\vert_{\mathbf{x}_* = \mu_\mathbf{x}} (\mathbf{x}_* - \mu_\mathbf{x_*}) + \mathcal{O} (\mathbf{x_*}^2) \end{aligned}$

and then the variance function:

$\begin{aligned} \mathbf{z}_\sigma = \nu^2_\text{GP}(\mathbf{x_*})= \nu^2_\text{GP}(\mu_\mathbf{x_*}) + \nabla \nu^2_\text{GP}\bigg\vert_{\mathbf{x}_* = \mu_\mathbf{x}} (\mathbf{x}_* - \mu_\mathbf{x_*}) + \mathcal{O} (\mathbf{x_*}^2) \end{aligned}$

Linearized Predictive Mean and Variance¶

$\begin{aligned} m(\mu_\mathbf{x_*}, \Sigma_\mathbf{x_*}) &= \mu_\text{GP}(\mu_\mathbf{x_*})\\ v(\mu_\mathbf{x_*}, \Sigma_\mathbf{x_*}) &= \nu^2_\text{GP}(\mu_{x_*}) + \nabla_\mathbf{x_*} \mu_\text{GP}(\mu_{x_*})^\top \Sigma_{x_*} \nabla_\mathbf{x_*} \mu_\text{GP}(\mu_{x_*}) + \frac{1}{2} \text{Tr}\left\{ \frac{\partial^2 \nu^2(\mu_{x_*})}{\partial x_* \partial x_*^\top} \Sigma_{x_*}\right\} \end{aligned}$

where $\nabla_x$ is the gradient of the function $f(\mu_x)$ w.r.t. $x$ and $\nabla_x^2 f(\mu_x)$ is the second derivative (the Hessian) of the function $f(\mu_x)$ w.r.t. $x$ . This is a second-order approximation which has that expensive Hessian term. There have have been studies that have shown that that term tends to be neglible in practice and a first-order approximation is typically enough.

Practically speaking, this leaves us with the following predictive mean and variance functions:

$\begin{aligned} \mu_\text{GP}(\mathbf{x_*}) &= k(\mathbf{x_*}) \, \mathbf{K}_{GP}^{-1}y=k(\mathbf{x_*}) \, \alpha \\ \nu_{GP}^2(\mathbf{x_*}) &= \sigma_y^2 + {\color{red}{\nabla_{\mu_\text{GP}}\,\Sigma_\mathbf{x_*} \,\nabla_{\mu_\text{GP}}^\top} }+ k_{**}- {\bf k}_* ({\bf K}+\sigma_y^2 \mathbf{I}_N )^{-1} {\bf k}_{*}^{\top} \end{aligned}$

As seen above, the only extra term we need to include is the derivative of the mean function that is present in the predictive variance term.

Conditional Gaussian Distributions¶

I: Additive Noise Model ( $x,f$ )¶

This is the noise $$ \begin{bmatrix} x \ y \end{bmatrix} \sim \mathcal{N} \left( \begin{bmatrix} \mu_{x} \ \mu_{y} \end{bmatrix}, \begin{bmatrix} \Sigma_x & C \ C^\top & \Pi \end{bmatrix} \right) $$ where $$ \begin{aligned} \mu_y &= f(\mu_x) \ \Pi &= \nabla_x f(\mu_x) : \Sigma_x : \nabla_x f(\mu_x)^\top + \nu^2(x) \ C &= \Sigma_x : \nabla_x^\top f(\mu_x) \end{aligned} $$ So if we want to make predictions with our new model, we will have the final equation as: $$ \begin{aligned} f &\sim \mathcal{N}(f|\mu_{GP}, \nu^2_{GP}) \ \mu_{GP} &= K_{} K_{GP}^{-1}y=K_{} \alpha \ \nu^2_{GP} &= K_{**} - K_{*} K_{GP}^{-1}K_{*} + \tilde{\Sigma}_x \end{aligned} $$ where $\tilde{\Sigma}_x = \nabla_x \mu_{GP} \Sigma_x \nabla \mu_{GP}^\top$ .

Other GP Methods¶

We can extend this method to other GP algorithms including sparse GP models. The only thing that changes are the original $\mu_{GP}$ and $\nu^2_{GP}$ equations. In a sparse GP we have the following predictive functions $$ \begin{aligned} \mu_{SGP} &= K_{z}K_{zz}^{-1}m \ \nu^2_{SGP} &= K_{*} - K_{z}\left[ K_{zz}^{-1} - K_{zz}^{-1}SK_{zz} \right]K_{z}^{\top} \end{aligned} $$ So the new predictive functions will be: $$ \begin{aligned} \mu_{SGP} &= K_{*z}K_{zz}^{-1}m \ \nu^2_{SGP} &= K_{} - K_{*z}\left[ K_{zz}^{-1} - K_{zz}^{-1}SK_{zz} \right]K_{*z}^{\top} + \tilde{\Sigma}_x \end{aligned} $$ As shown above, this is a fairly extensible method that offers a cheap improved predictive variance estimates on an already trained GP model. Some future work could be evaluating how other GP models, e.g. Sparse Spectrum GP, Multi-Output GPs, e.t.c.

II: Non-Additive Noise Model¶

III: Quadratic Approximation¶

Literature¶

Gaussian Process Priors with Uncertain Inputs: Multiple-Step-Ahead Prediction - Girard et. al. (2002) - Technical Report

Does the derivation for taking the expectation and variance for the Taylor series expansion of the predictive mean and variance.
Expectation Propagation in Gaussian Process Dynamical Systems: Extended Version - Deisenroth & Mohamed (2012) - NeuRIPS

First time the moment matching and linearized version appears in the GP literature.
Learning with Uncertainty-Gaussian Processes and Relevance Vector Machines - Candela (2004) - Thesis

Full law of iterated expectations and conditional variance.
Gaussian Process Training with Input Noise - McHutchon & Rasmussen et. al. (2012) - NeuRIPS

Used the same logic but instead of just approximated the posterior, they also applied this to the model which resulted in an iterative procedure.
Multi-class Gaussian Process Classification with Noisy Inputs - Villacampa-Calvo et. al. (2020) - axriv

Applied the first order approximation using the Taylor expansion for a classification problem. Compared this to the variational inference.

Supplementary¶

Error Propagation¶

To see more about error propagation and the relation to the mean and variance, see here.

Linearization (Taylor Expansions)¶

Analytical Moments¶

Mean Function¶

Variance Function¶

Taylor Approximation¶

Linearized Predictive Mean and Variance¶

Conditional Gaussian Distributions¶

I: Additive Noise Model ( $x,f$ )¶

Other GP Methods¶

II: Non-Additive Noise Model¶

III: Quadratic Approximation¶

Literature¶

Supplementary¶

Error Propagation¶

Fubini's Theorem¶

Law of Iterated Expecations¶

Conditional Variance¶

Linearization (Taylor Expansions)¶

Analytical Moments¶

Mean Function¶

Variance Function¶

Taylor Approximation¶

Linearized Predictive Mean and Variance¶

Conditional Gaussian Distributions¶

I: Additive Noise Model (x,fx,f)¶

Other GP Methods¶

II: Non-Additive Noise Model¶

III: Quadratic Approximation¶

Literature¶

Supplementary¶

Error Propagation¶

Fubini's Theorem¶

Law of Iterated Expecations¶

Conditional Variance¶

I: Additive Noise Model ( $x,f$ )¶