Source

Inference¶

Variational Inference¶

This section outlines a few interesting papers I found where they are trying to improve how we do variational inference. I try to stick to methods where people have tried and succeeded at applying them to GPs. Below are a few key SOTA objective functions that you may come across in the GP literature. The most common is definitely the Variational ELBO but there are a few unknown objective functions that came out recently and I think they might be useful in the future. We just need to get them implemented and tested. Along the way there have been other modifications.

Variational Evidence Lower Bound (ELBO)¶

This is the standard objective function that you will find the literature.

Scalable Variational Gaussian Process Classification - Hensman et. al. (2015)

Details

$\mathcal{L}_{ELBO} = \sum_{i=1}^{N} \mathbb{E}_{q(\mathbf{u})} \left[ \mathbb{E}_{f(f|\mathbf{u})} \left[ \log p(y_i | f_i) \right] \right] - \beta D_{KL}\left[ q(\mathbf{u} || p(\mathbf{u})) \right]$

where:

$N$ - number of data points
$p(\mathbf{u})$ - prior distribution for the inducing function values
$q(\mathbf{u})$ - variational distribution for the inducing function values
$\beta$ - free parameter for the $D_{KL}$ regularization penalization

The Variational Gaussian Process by Tran et. al. (2016)

-> Paper

-> Code

-> Video

Scalable Training of Inference Networks for Gaussian-Process Models - Shi et. al. (2019)

-> Paper

-> Code

Sparse Orthogonal Variational Inference for Gaussian Processes - Shi et. al. (2020)

-> Paper

-> Code

Automated Augmented Conjugate Inference for Non-conjugate Gaussian Process Models - Galy-Fajou et. al. (2020)

-> Paper

-> Video | JuliaCon

-> Twitter

-> Code

Recyclable Gaussian Processes - Moreno-Munoz et. al. (2020)

-> Paper

-> Code

-> Tweet

-> Slides

Some excellent Slides

Monte Carlo¶

Automated Augmented Conjugate Inference for Non-conjugate Gaussian Process Models - Galy-Fajou et. al. (2020)

-> Paper

-> Code

-> Tweet

Importance Weighted Variational Inference (IWVI)¶

They propose a way to do importance sampling coupled with variational inference to improve single layer and multi-layer GPs and have shown that they can get equivalent or better results than just standard variational inference.

Importance Weighting and Variational Inference - Domke & Sheldon (2018)

Hybrid Schemes¶

Automated Augmented Conjugate Inference for Non-conjugate Gaussian Process Models - Galy-Fajou et. al. (2020)

Combines SVI and Gibbs Sampling

-> Paper

-> Code

-> Package

Predictive Log Likelihood (PLL)¶

Sparse Gaussian Process Regression Beyond Variational Inference - Jankowiak et. al. (2019)

Details

$\begin{aligned} \mathcal{L}_{PLL} &= \mathbb{E}_{p_{data}(\mathbf{y}, \mathbf{x})} \left[ \log p(\mathbf{y|x})\right] - \beta D_{KL}\left[ q(\mathbf{u}|| p(\mathbf{u})\right] \\ &\approx \sum_{i=1}^{N} \log \mathbb{E}_{q(\mathbf{u})} \left[ \int p(y_i |f_i) p(f_i | \mathbf{u})df_i \right] - \beta D_{KL}\left[ q(\mathbf{u}) || p(\mathbf{u}) \right] \end{aligned}$

where:

$N$ - number of data points
$p(\mathbf{u})$ - prior distribution for the inducing function values
$q(\mathbf{u})$ - variational distribution for the inducing function values

Generalized Variational Inference (GVI)¶

Generalized Variational Inference - Knoblauch et. al. (2019)

A generalized Bayesian inference framework. It goes into a different variational family related to Renyi's family of Information theoretic methods; which isn't very typical because normally we look at the Shannon perspective. They had success applying it to Bayesian Neural Networks and Deep Gaussian Processes.

Deep GP paper

Gradient Descent Regimes¶

Natural Gradients (NGs)¶

Natural Gradients in Practice: Non-Conjugate Variational Inference in Gaussian Process Models - Salimbeni et. al. (2018) | Code

This paper argues that training sparse GP algorithms with gradient descent can be quite slow due to the need to optimize the variational parameters $q_\phi(u)$ as well as the model parameters. So they propose to use the natural gradient for the variational parameters and then the standard gradient methods for the remaining parameters. They show that the SVGP and the DGP methods all converge much faster with this training regime. I imagine this would also be super useful for the BayesianGPLVM where we also have variational parameters for our inputs as well.

-> Blog - Agustinus Kristiadi

-> Lecture 3.5 Natural Gradient Optimization (I) | Neural Networks | MLCV 2017

Noisy Natural Gradient as Variational Inference - Zhang (2018) - Code
PyTorch Implementation

Parallel training of DNNs with Natural Gradient and Parameter Averaging - Povey et. al. (2014)

A seamingly drop-in replacement for stochastic gradient descent with some added benefits of being shown to improve generalization tasks, stability of the training, and can help obtain high quality uncertainty estimates.

-> Paper

-> Code

-> Blog

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm - Lui & Wang (2016)

A tractable approach for learning high dimensional prob dist using Functional Gradient Descent in RKHS. It's from a connection with the derivative of the KL divergence and the Stein's identity.

-> Stein's Method Webpage

-> Pyro Implementation

Robust, Accurate Stochastic Optimization for Variational Inference - Dhaka et. al. (09-2020)

More robust stochastic variational inference method using Polyak-Ruppert averaging with MCMC.

-> Paper

-> Tweet

Uncategorized¶

Regularization¶

Regularized Sparse Gaussian Processes - Meng & Lee (2019) [arxiv]

Impose a regularization coefficient on the KL term in the Sparse GP implementation. Addresses issue where the distribution of the inducing inputs fail to capture the distribution of the training inputs.