Inference Schemes#

Source | Deisenroth - Sampling

Advances in VI - Notebook

  • Numerical Integration (low dimension)

  • Bayesian Quadrature

  • Expectation Propagation

  • Conjugate Priors (Gaussian Likelihood w/ GP Prior)

  • Subset Methods (Nystrom)

  • Fast Linear Algebra (Krylov, Fast Transforms, KD-Trees)

  • Variational Methods (Laplace, Mean-Field, Expectation Propagation)

  • Monte Carlo Methods (Gibbs, Metropolis-Hashings, Particle Filter)

Local Methods

Sampling Methods


Local Methods#

Mean Squared Error (MSE)#

In the case of regression, we can use the MSE as a loss function. This will exactly solve for the negative log-likelihood term above.

Sources:


Maximum A Priori (MAP)#

Loss Function#

(22)#\[ \boldsymbol{\theta}_{\text{MAP}} = \operatorname*{argmax}_{\boldsymbol{\theta}} - \frac{1}{N}\sum_n^N\log p\left(y_n|f(x_n; \theta)\right) + \log p(\theta) \]

Maximum Likelihood Estimation (MLE)#

Loss Function#

\[ \theta_{map} = \operatorname*{argmin}_{\theta} - \frac{1}{N}\sum_n^N \log p(y_n| f(x_n;\theta)) \]

Remark 1

You can get an intuition that this will lead to local minimum as there are many possible solutions that would minimize this equation. Or even worse, there are many possible local minimum that we could get stuck in when trying to optimize for this.

KL-Divergence (Forward)#

\[ \text{D}_{\text{KL}}\left[ p_*(x) || p(x;\theta) \right] = \mathbb{E}_{x\sim p_*}\left[ \log \frac{p_*(x)}{p(x;\theta)}\right] \]

This is the distance between the best distribution, \(p_*(x)\), for the data and the parameterized version, \(p(x;\theta)\).

There is an equivalence between the (Forward) KL-Divergence and the Maximum Likelihood Estimation. Maximizing the likelihood expresses it as maximizing the likelihood of the data given our estimated distribution. Whereas the KL-divergence is a distance measure between the parameterized distribution and the “true” or “best” distribution of the real data. They are equivalent formulations but the MLE equations shows how this is a proxy for fitting the “real” data distribution to the estimated distribution function.


Laplace Approximation#

This is where we approximate the posterior with a Gaussian distribution \(\mathcal{N}(\mu, A^{-1})\).

  • \(w=w_{map}\), finds a mode (local max) of \(p(w|D)\)

  • \(A = \nabla\nabla \log p(D|w) p(w)\) - very expensive calculation

  • Only captures a single mode and discards the probability mass

    • similar to the KLD in one direction.

Sources


Variational Inference#

Definition: We can find the best approximation within a given family w.r.t. KL-Divergence. $\( \text{KLD}[q||p] = \int_w q(w) \log \frac{q(w)}{p(w|D)}dw \)\( Let \)q(w)=\mathcal{N}(\mu, S)\( and then we minimize KLD\)(q||p)\( to find the parameters \)\mu, S$.

“Approximate the posterior, not the model” - James Hensman.

We write out the marginal log-likelihood term for our observations, \(y\).

\[ \log p(y;\theta) = \mathbb{E}_{x \sim p(x|y;\theta)}\left[ \log p(y|\theta) \right] \]

We can expand this term using Bayes rule: \(p(y) = p(x,y)p(x|y)\).

\[ \log p(y;\theta) = \mathbb{E}_{x \sim p(x|y;\theta)}\left[ \log \underbrace{p(x,y;\theta)}_{prior} - \log \underbrace{p(x|y;\theta)}_{posterior}\right] \]

where \(p(x,y;\theta)\) is the joint distribution function and \(p(x|y;\theta)\) is the posterior distribution function.

We can use a variational distribution, \(q(x|y;\phi)\) which will approximate the

\[ \log p(y;\theta) \geq \mathcal{L}_{ELBO}(\theta,\phi) \]

where \(\mathcal{L}_{ELBO}\) is the Evidence Lower Bound (ELBO) term. This serves as an upper bound to the true marginal likelihood.

\[ \mathcal{L}_{ELBO}(\theta,\phi) = \mathbb{E}_{q(x|y;\phi)}\left[ \log p(x,y;\theta) - \log q(x|y;\phi) \right] \]

we can rewrite this to single out the expectations. This will result in two important quantities.

\[ \mathcal{L}_{ELBO}(\theta,\phi) = \underbrace{\mathbb{E}_{q(x|y;\phi)}\left[ \log p(x,y;\theta)\right]}_{\text{Reconstruction}} - \underbrace{\text{D}_{\text{KL}}\left[ \log q(x|y;\phi) || p(x;\theta)\right]}_{\text{Regularization}} \]

Sampling Methods#

Monte Carlo#

We can produce samples from the exact posterior by defining a specific Monte Carlo chain.

We actually do this in practice with NNs because of the stochastic training regimes. We modify the SGD algorithm to define a scalable MCMC sampler.

Here is a visual demonstration of some popular MCMC samplers.

Hamiltonian Monte Carlo#

Stochastic Langevin Dynamics#