Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Gaussian Processes — regression & classification

Gaussian processes (GPs) give us a nonparametric prior over functions and an analytically tractable posterior in the regression case — a small, well-conditioned linear-algebra problem at the heart of “Bayesian deep learning for the reasonable person.” This section walks through the two canonical GP inference modes — conjugate regression and non-conjugate latent-variable classification — using pyrox’s gp module. For the textbook treatment see Rasmussen & Williams (2006); for a broader probabilistic-modeling context see Murphy (2012).

Model

Let xRDx \in \mathbb{R}^{D} be an input and f:RDRf : \mathbb{R}^{D} \to \mathbb{R} the latent function of interest. A Gaussian process prior

fGP(m(),kθ(,))f \sim \mathcal{GP}\bigl(m(\cdot),\, k_\theta(\cdot, \cdot)\bigr)

is fully specified by a mean function m:RDRm : \mathbb{R}^D \to \mathbb{R} (typically zero) and a positive-definite kernel kθ:RD×RDRk_\theta : \mathbb{R}^D \times \mathbb{R}^D \to \mathbb{R} parameterized by hyperparameters θ (length-scales, amplitude, kernel-specific extras). Evaluated at any finite collection X={xi}i=1NX = \{x_i\}_{i=1}^N, the vector f=f(X)\mathbf{f} = f(X) is jointly Gaussian with fN(mX,KXX)\mathbf{f} \sim \mathcal{N}(\mathbf{m}_X,\, K_{XX}) where [KXX]ij=kθ(xi,xj)[K_{XX}]_{ij} = k_\theta(x_i, x_j).

Conjugate regression

For Gaussian noise yi=f(xi)+εiy_i = f(x_i) + \varepsilon_i, εiiidN(0,σn2)\varepsilon_i \overset{\text{iid}}{\sim} \mathcal{N}(0, \sigma_n^2), the posterior predictive at new inputs XX_\star is closed-form:

μ=m+KX(KXX+σn2I)1(ymX),Σ=KKX(KXX+σn2I)1KX.\begin{aligned} \mu_\star &= \mathbf{m}_\star + K_{\star X}\,(K_{XX} + \sigma_n^2 I)^{-1}(\mathbf{y} - \mathbf{m}_X),\\ \Sigma_\star &= K_{\star\star} - K_{\star X}\,(K_{XX} + \sigma_n^2 I)^{-1}\,K_{X\star}. \end{aligned}

Hyperparameters are learned by maximizing the log marginal likelihood

logp(yθ,σn)=12(ymX)(KXX+σn2I)1(ymX)12logKXX+σn2IN2log2π.\log p(\mathbf{y} \mid \theta, \sigma_n) = -\tfrac{1}{2}(\mathbf{y} - \mathbf{m}_X)^\top (K_{XX} + \sigma_n^2 I)^{-1}(\mathbf{y} - \mathbf{m}_X) - \tfrac{1}{2}\log\lvert K_{XX} + \sigma_n^2 I\rvert - \tfrac{N}{2}\log 2\pi.

Non-conjugate classification

For binary outputs yi{0,1}y_i \in \{0, 1\} with Bernoulli likelihood yif(xi)Bernoulli(σ(f(xi)))y_i \mid f(x_i) \sim \text{Bernoulli}\bigl(\sigma(f(x_i))\bigr) (where σ is the logistic link), the posterior p(fy)p(\mathbf{f} \mid \mathbf{y}) is no longer Gaussian. Classical options are a Laplace approximation MacKay (1992) — a Gaussian approximation centered at the MAP with precision equal to the Hessian of the negative log posterior — or stochastic variational inference Hensman et al. (2013). In pyrox the latent-GP model is the same equinox module as in the regression case; only the likelihood changes.

Numerical considerations

Notebooks

References

References
  1. Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press. http://www.gaussianprocess.org/gpml/
  2. Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.
  3. MacKay, D. J. C. (1992). The Evidence Framework Applied to Classification Networks. Neural Computation, 4(5), 720–736. 10.1162/neco.1992.4.5.720
  4. Hensman, J., Fusi, N., & Lawrence, N. D. (2013). Gaussian Processes for Big Data. Uncertainty in Artificial Intelligence (UAI).
  5. Titsias, M. K. (2009). Variational Learning of Inducing Variables in Sparse Gaussian Processes. International Conference on Artificial Intelligence and Statistics (AISTATS).