Bayesian Inference


Inference Schemes

Source | Deisenroth - Sampling

Advances in VI - Notebook

  • Numerical Integration (low dimension)
  • Bayesian Quadrature
  • Expectation Propagation
  • Conjugate Priors (Gaussian Likelihood w/ GP Prior)
  • Subset Methods (Nystrom)
  • Fast Linear Algebra (Krylov, Fast Transforms, KD-Trees)
  • Variational Methods (Laplace, Mean-Field, Expectation Propagation)
  • Monte Carlo Methods (Gibbs, Metropolis-Hashings, Particle Filter)

Local Methods

Sampling Methods

Local Methods

Mean Squared Error (MSE)

In the case of regression, we can use the MSE as a loss function. This will exactly solve for the negative log-likelihood term above.


Maximum A Priori (MAP)
Loss Function
θMAP=argmaxθ1NnNlogp(ynf(xn;θ))+logp(θ)\boldsymbol{\theta}_{\text{MAP}} = \operatorname*{argmax}_{\boldsymbol{\theta}} - \frac{1}{N}\sum_n^N\log p\left(y_n|f(x_n; \theta)\right) + \log p(\theta)

Maximum Likelihood Estimation (MLE)
Loss Function
θmap=argminθ1NnNlogp(ynf(xn;θ))\theta_{map} = \operatorname*{argmin}_{\theta} - \frac{1}{N}\sum_n^N \log p(y_n| f(x_n;\theta))
KL-Divergence (Forward)
DKL[p(x)p(x;θ)]=Exp[logp(x)p(x;θ)]\text{D}_{\text{KL}}\left[ p_*(x) || p(x;\theta) \right] = \mathbb{E}_{x\sim p_*}\left[ \log \frac{p_*(x)}{p(x;\theta)}\right]

This is the distance between the best distribution, p(x)p_*(x), for the data and the parameterized version, p(x;θ)p(x;\theta).

There is an equivalence between the (Forward) KL-Divergence and the Maximum Likelihood Estimation. Maximizing the likelihood expresses it as maximizing the likelihood of the data given our estimated distribution. Whereas the KL-divergence is a distance measure between the parameterized distribution and the “true” or “best” distribution of the real data. They are equivalent formulations but the MLE equations shows how this is a proxy for fitting the “real” data distribution to the estimated distribution function.

Laplace Approximation

This is where we approximate the posterior with a Gaussian distribution N(μ,A1)\mathcal{N}(\mu, A^{-1}).

  • w=wmapw=w_{map}, finds a mode (local max) of p(wD)p(w|D)
  • A=logp(Dw)p(w)A = \nabla\nabla \log p(D|w) p(w) - very expensive calculation
  • Only captures a single mode and discards the probability mass
    • similar to the KLD in one direction.


Variational Inference

Definition: We can find the best approximation within a given family w.r.t. KL-Divergence.

KLD[qp]=wq(w)logq(w)p(wD)dw \text{KLD}[q||p] = \int_w q(w) \log \frac{q(w)}{p(w|D)}dw

Let q(w)=N(μ,S)q(w)=\mathcal{N}(\mu, S) and then we minimize KLD(qp)(q||p) to find the parameters μ,S\mu, S.

“Approximate the posterior, not the model” - James Hensman.

We write out the marginal log-likelihood term for our observations, yy.

logp(y;θ)=Exp(xy;θ)[logp(yθ)]\log p(y;\theta) = \mathbb{E}_{x \sim p(x|y;\theta)}\left[ \log p(y|\theta) \right]

We can expand this term using Bayes rule: p(y)=p(x,y)p(xy)p(y) = p(x,y)p(x|y).

logp(y;θ)=Exp(xy;θ)[logp(x,y;θ)priorlogp(xy;θ)posterior]\log p(y;\theta) = \mathbb{E}_{x \sim p(x|y;\theta)}\left[ \log \underbrace{p(x,y;\theta)}_{prior} - \log \underbrace{p(x|y;\theta)}_{posterior}\right]

where p(x,y;θ)p(x,y;\theta) is the joint distribution function and p(xy;θ)p(x|y;\theta) is the posterior distribution function.

We can use a variational distribution, q(xy;ϕ)q(x|y;\phi) which will approximate the

logp(y;θ)LELBO(θ,ϕ)\log p(y;\theta) \geq \mathcal{L}_{ELBO}(\theta,\phi)

where LELBO\mathcal{L}_{ELBO} is the Evidence Lower Bound (ELBO) term. This serves as an upper bound to the true marginal likelihood.

LELBO(θ,ϕ)=Eq(xy;ϕ)[logp(x,y;θ)logq(xy;ϕ)]\mathcal{L}_{ELBO}(\theta,\phi) = \mathbb{E}_{q(x|y;\phi)}\left[ \log p(x,y;\theta) - \log q(x|y;\phi) \right]

we can rewrite this to single out the expectations. This will result in two important quantities.

LELBO(θ,ϕ)=Eq(xy;ϕ)[logp(x,y;θ)]ReconstructionDKL[logq(xy;ϕ)p(x;θ)]Regularization\mathcal{L}_{ELBO}(\theta,\phi) = \underbrace{\mathbb{E}_{q(x|y;\phi)}\left[ \log p(x,y;\theta)\right]}_{\text{Reconstruction}} - \underbrace{\text{D}_{\text{KL}}\left[ \log q(x|y;\phi) || p(x;\theta)\right]}_{\text{Regularization}}

Sampling Methods

Monte Carlo

We can produce samples from the exact posterior by defining a specific Monte Carlo chain.

We actually do this in practice with NNs because of the stochastic training regimes. We modify the SGD algorithm to define a scalable MCMC sampler.

Here is a visual demonstration of some popular MCMC samplers.

Markov Chain Monte Carlo

Hamiltonian Monte Carlo

Stochastic Gradient Langevin Dynamics