Inference¶
Variational Inference¶
This section outlines a few interesting papers I found where they are trying to improve how we do variational inference. I try to stick to methods where people have tried and succeeded at applying them to GPs. Below are a few key SOTA objective functions that you may come across in the GP literature. The most common is definitely the Variational ELBO but there are a few unknown objective functions that came out recently and I think they might be useful in the future. We just need to get them implemented and tested. Along the way there have been other modifications.
Variational Evidence Lower Bound (ELBO)¶
This is the standard objective function that you will find the literature.
Scalable Variational Gaussian Process Classification - Hensman et. al. (2015)
Details
where:
- N - number of data points
- p(\mathbf{u}) - prior distribution for the inducing function values
- q(\mathbf{u}) - variational distribution for the inducing function values
- \beta - free parameter for the D_{KL} regularization penalization
Scalable Training of Inference Networks for Gaussian-Process Models - Shi et. al. (2019)
-> Paper
-> Code
Automated Augmented Conjugate Inference for Non-conjugate Gaussian Process Models - Galy-Fajou et. al. (2020)
-> Paper
-> Code
-> Slides
Some excellent Slides
Monte Carlo¶
Automated Augmented Conjugate Inference for Non-conjugate Gaussian Process Models - Galy-Fajou et. al. (2020)
-> Paper
-> Code
-> Tweet
Importance Weighted Variational Inference (IWVI)¶
Deep Gaussian Processes with Importance-Weighted Variational Inference - Salimbeni et. al. (2019) - Paper | Code | Video | Poster | ICML 2019 Slides | Workshop Slides
They propose a way to do importance sampling coupled with variational inference to improve single layer and multi-layer GPs and have shown that they can get equivalent or better results than just standard variational inference.
- Importance Weighting and Variational Inference - Domke & Sheldon (2018)
Hybrid Schemes¶
Automated Augmented Conjugate Inference for Non-conjugate Gaussian Process Models - Galy-Fajou et. al. (2020)
Combines SVI and Gibbs Sampling
-> Paper
-> Code
-> Package
Predictive Log Likelihood (PLL)¶
Sparse Gaussian Process Regression Beyond Variational Inference - Jankowiak et. al. (2019)
Details
where:
- N - number of data points
- p(\mathbf{u}) - prior distribution for the inducing function values
- q(\mathbf{u}) - variational distribution for the inducing function values
Generalized Variational Inference (GVI)¶
Generalized Variational Inference - Knoblauch et. al. (2019)
A generalized Bayesian inference framework. It goes into a different variational family related to Renyi's family of Information theoretic methods; which isn't very typical because normally we look at the Shannon perspective. They had success applying it to Bayesian Neural Networks and Deep Gaussian Processes.
Gradient Descent Regimes¶
Natural Gradients (NGs)¶
Natural Gradients in Practice: Non-Conjugate Variational Inference in Gaussian Process Models - Salimbeni et. al. (2018) | Code
This paper argues that training sparse GP algorithms with gradient descent can be quite slow due to the need to optimize the variational parameters q_\phi(u) as well as the model parameters. So they propose to use the natural gradient for the variational parameters and then the standard gradient methods for the remaining parameters. They show that the SVGP and the DGP methods all converge much faster with this training regime. I imagine this would also be super useful for the BayesianGPLVM where we also have variational parameters for our inputs as well.
-> Blog - Agustinus Kristiadi
-> Lecture 3.5 Natural Gradient Optimization (I) | Neural Networks | MLCV 2017
- Noisy Natural Gradient as Variational Inference - Zhang (2018) - Code
- PyTorch Implementation
Parallel training of DNNs with Natural Gradient and Parameter Averaging - Povey et. al. (2014)
A seamingly drop-in replacement for stochastic gradient descent with some added benefits of being shown to improve generalization tasks, stability of the training, and can help obtain high quality uncertainty estimates.
-> Paper
-> Code
-> Blog
Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm - Lui & Wang (2016)
A tractable approach for learning high dimensional prob dist using Functional Gradient Descent in RKHS. It's from a connection with the derivative of the KL divergence and the Stein's identity.
-> Stein's Method Webpage
Robust, Accurate Stochastic Optimization for Variational Inference - Dhaka et. al. (09-2020)
More robust stochastic variational inference method using Polyak-Ruppert averaging with MCMC.
-> Paper
-> Tweet
Uncategorized¶
Regularization¶
- Regularized Sparse Gaussian Processes - Meng & Lee (2019) [arxiv]
Impose a regularization coefficient on the KL term in the Sparse GP implementation. Addresses issue where the distribution of the inducing inputs fail to capture the distribution of the training inputs.