Gaussian Processes — regression & classification
Gaussian processes (GPs) give us a nonparametric prior over functions and an analytically tractable posterior in the regression case — a small, well-conditioned linear-algebra problem at the heart of “Bayesian deep learning for the reasonable person.” This section walks through the two canonical GP inference modes — conjugate regression and non-conjugate latent-variable classification — using pyrox’s gp module. For the textbook treatment see Rasmussen & Williams (2006); for a broader probabilistic-modeling context see Murphy (2012).
Model¶
Let be an input and the latent function of interest. A Gaussian process prior
is fully specified by a mean function (typically zero) and a positive-definite kernel parameterized by hyperparameters θ (length-scales, amplitude, kernel-specific extras). Evaluated at any finite collection , the vector is jointly Gaussian with where .
Conjugate regression¶
For Gaussian noise , , the posterior predictive at new inputs is closed-form:
Hyperparameters are learned by maximizing the log marginal likelihood
Non-conjugate classification¶
For binary outputs with Bernoulli likelihood (where σ is the logistic link), the posterior is no longer Gaussian. Classical options are a Laplace approximation MacKay (1992) — a Gaussian approximation centered at the MAP with precision equal to the Hessian of the negative log posterior — or stochastic variational inference Hensman et al. (2013). In pyrox the latent-GP model is the same equinox module as in the regression case; only the likelihood changes.
Numerical considerations¶
- Cholesky factorization is the workhorse. Factor once, then solve with via two triangular systems; this is time / memory and dominates both training and prediction.
- Jitter / nugget. is positive definite in exact arithmetic but loses rank near the data limit. Adding with is standard. Pyrox threads this through its kernel call.
- Parameterization of positives. Length-scales, amplitudes, and noise variances are constrained to . Train in the unconstrained space — typically or — so gradients stay finite near zero. The masterclass notebooks show three ways pyrox wires this up.
- Inducing points. For , exact GPs are impractical. Sparse variational GPs Titsias (2009)Hensman et al. (2013) pick inducing locations and give a algorithm; this isn’t covered in the tutorials below but is worth knowing when they feel slow.
- Prediction variance conditioning. Computing via is numerically safer when done as with , rather than inverting the kernel explicitly.
Notebooks¶
exact_gp_regression— conjugate regression end-to-end: prior / marginal likelihood / posterior predictive on a 1D synthetic dataset, three patterns for wiring hyperparameters (see the masterclass sub-section).latent_gp_classification— latent-GP + Bernoulli likelihood via NumPyro’sNUTSsampler, with the same three patterns for parameter handling.
References¶
- Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press. http://www.gaussianprocess.org/gpml/
- Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.
- MacKay, D. J. C. (1992). The Evidence Framework Applied to Classification Networks. Neural Computation, 4(5), 720–736. 10.1162/neco.1992.4.5.720
- Hensman, J., Fusi, N., & Lawrence, N. D. (2013). Gaussian Processes for Big Data. Uncertainty in Artificial Intelligence (UAI).
- Titsias, M. K. (2009). Variational Learning of Inducing Variables in Sparse Gaussian Processes. International Conference on Artificial Intelligence and Statistics (AISTATS).