Skip to article frontmatterSkip to article content

Bayesian Inference

CSIC
UCM
IGEO

Quick Summary

Inference refers to how we find the posterior distribution. In the previous section, we’ve already seen how we end up with a posterior distribution of the form

p(θD,M)=1Zp(Dθ,M)p(θ)p(\boldsymbol{\theta}|\mathcal{D},\mathcal{M}) = \frac{1}{Z} p(\mathcal{D}|\boldsymbol{\theta},\mathcal{M})p(\boldsymbol{\theta})

where Z=Dp(Dθ,M)p(θ)dDZ=\int_\mathcal{D}p(\mathcal{D}|\boldsymbol{\theta},\mathcal{M})p(\boldsymbol{\theta})d\mathcal{D} is the marginal likelihood. The problem term within this posterior distribution is the marginal likelihood term, ZZ, because it involves solving an integral.

In general, integration is a very hard problem, especially in high-dimensional settings. The entire literature of inference revolves around how to

Conjugate Methods. These methods include methods which enable us to solve for this integral exactly. Normally when things are Gaussian-like and linear, we can just do complex linear algebra to obtain an exact expression for the marginal likelihood.

Local Methods. These methods are local approximations whereby we look for a single mode of the (potentially) multi-modal posterior distribution. These methods convert the integration problem into an optimization problems. These family of methods include MAP estimation, Laplacian approximation, variational inference, and expectation propagation.

Sampling. These methods try to actually sample from the posterior exactly. If the problem is low dimensional, we can use numerical integration directly, e.g., Trapz, Quadrature or Bayesian Quadrature. However, often times the problem is not. So there are smarter sampling methods like MC, MHMC, Gibbs, MCMC, HMC, and Particle Filters.


Local Methods

Mean Squared Error (MSE)

In the case of regression, we can use the MSE as a loss function. This will exactly solve for the negative log-likelihood term above.

Sources:


Maximum A Priori (MAP)

Loss Function
θMAP=argmaxθ1NnNlogp(ynf(xn;θ))+logp(θ)\boldsymbol{\theta}_{\text{MAP}} = \operatorname*{argmax}_{\boldsymbol{\theta}} - \frac{1}{N}\sum_n^N\log p\left(y_n|f(x_n; \theta)\right) + \log p(\theta)

Maximum Likelihood Estimation (MLE)

Loss Function
θmap=argminθ1NnNlogp(ynf(xn;θ))\theta_{map} = \operatorname*{argmin}_{\theta} - \frac{1}{N}\sum_n^N \log p(y_n| f(x_n;\theta))

KL-Divergence (Forward)

DKL[p(x)p(x;θ)]=Exp[logp(x)p(x;θ)]\text{D}_{\text{KL}}\left[ p_*(x) || p(x;\theta) \right] = \mathbb{E}_{x\sim p_*}\left[ \log \frac{p_*(x)}{p(x;\theta)}\right]

This is the distance between the best distribution, p(x)p_*(x), for the data and the parameterized version, p(x;θ)p(x;\theta).

There is an equivalence between the (Forward) KL-Divergence and the Maximum Likelihood Estimation. Maximizing the likelihood expresses it as maximizing the likelihood of the data given our estimated distribution. Whereas the KL-divergence is a distance measure between the parameterized distribution and the “true” or “best” distribution of the real data. They are equivalent formulations but the MLE equations shows how this is a proxy for fitting the “real” data distribution to the estimated distribution function.


Laplace Approximation

This is where we approximate the posterior with a Gaussian distribution N(μ,A1)\mathcal{N}(\mu, A^{-1}).

  • w=wmapw=w_{map}, finds a mode (local max) of p(wD)p(w|D)
  • A=logp(Dw)p(w)A = \nabla\nabla \log p(D|w) p(w) - very expensive calculation
  • Only captures a single mode and discards the probability mass
    • similar to the KLD in one direction.

Sources


Variational Inference

Definition: We can find the best approximation within a given family w.r.t. KL-Divergence.

KLD[qp]=wq(w)logq(w)p(wD)dw \text{KLD}[q||p] = \int_w q(w) \log \frac{q(w)}{p(w|D)}dw

Let q(w)=N(μ,S)q(w)=\mathcal{N}(\mu, S) and then we minimize KLD(qp)(q||p) to find the parameters μ,S\mu, S.

“Approximate the posterior, not the model” - James Hensman.

We write out the marginal log-likelihood term for our observations, yy.

logp(y;θ)=Exp(xy;θ)[logp(yθ)]\log p(y;\theta) = \mathbb{E}_{x \sim p(x|y;\theta)}\left[ \log p(y|\theta) \right]

We can expand this term using Bayes rule: p(y)=p(x,y)p(xy)p(y) = p(x,y)p(x|y).

logp(y;θ)=Exp(xy;θ)[logp(x,y;θ)priorlogp(xy;θ)posterior]\log p(y;\theta) = \mathbb{E}_{x \sim p(x|y;\theta)}\left[ \log \underbrace{p(x,y;\theta)}_{prior} - \log \underbrace{p(x|y;\theta)}_{posterior}\right]

where p(x,y;θ)p(x,y;\theta) is the joint distribution function and p(xy;θ)p(x|y;\theta) is the posterior distribution function.

We can use a variational distribution, q(xy;ϕ)q(x|y;\phi) which will approximate the

logp(y;θ)LELBO(θ,ϕ)\log p(y;\theta) \geq \mathcal{L}_{ELBO}(\theta,\phi)

where LELBO\mathcal{L}_{ELBO} is the Evidence Lower Bound (ELBO) term. This serves as an upper bound to the true marginal likelihood.

LELBO(θ,ϕ)=Eq(xy;ϕ)[logp(x,y;θ)logq(xy;ϕ)]\mathcal{L}_{ELBO}(\theta,\phi) = \mathbb{E}_{q(x|y;\phi)}\left[ \log p(x,y;\theta) - \log q(x|y;\phi) \right]

we can rewrite this to single out the expectations. This will result in two important quantities.

LELBO(θ,ϕ)=Eq(xy;ϕ)[logp(x,y;θ)]ReconstructionDKL[logq(xy;ϕ)p(x;θ)]Regularization\mathcal{L}_{ELBO}(\theta,\phi) = \underbrace{\mathbb{E}_{q(x|y;\phi)}\left[ \log p(x,y;\theta)\right]}_{\text{Reconstruction}} - \underbrace{\text{D}_{\text{KL}}\left[ \log q(x|y;\phi) || p(x;\theta)\right]}_{\text{Regularization}}

Sampling Methods

Monte Carlo

We can produce samples from the exact posterior by defining a specific Monte Carlo chain.

We actually do this in practice with NNs because of the stochastic training regimes. We modify the SGD algorithm to define a scalable MCMC sampler.

Here is a visual demonstration of some popular MCMC samplers.

Markov Chain Monte Carlo

Hamiltonian Monte Carlo

Stochastic Gradient Langevin Dynamics