Learn More

Quick Summary¶

Inference refers to how we find the posterior distribution. In the previous section, we’ve already seen how we end up with a posterior distribution of the form

p(\boldsymbol{\theta}|\mathcal{D},\mathcal{M}) = \frac{1}{Z} p(\mathcal{D}|\boldsymbol{\theta},\mathcal{M})p(\boldsymbol{\theta})

(1)

where $Z=\int_\mathcal{D}p(\mathcal{D}|\boldsymbol{\theta},\mathcal{M})p(\boldsymbol{\theta})d\mathcal{D}$ is the marginal likelihood. The problem term within this posterior distribution is the marginal likelihood term, $Z$ , because it involves solving an integral.

In general, integration is a very hard problem, especially in high-dimensional settings. The entire literature of inference revolves around how to

Conjugate Methods. These methods include methods which enable us to solve for this integral exactly. Normally when things are Gaussian-like and linear, we can just do complex linear algebra to obtain an exact expression for the marginal likelihood.

Local Methods. These methods are local approximations whereby we look for a single mode of the (potentially) multi-modal posterior distribution. These methods convert the integration problem into an optimization problems. These family of methods include MAP estimation, Laplacian approximation, variational inference, and expectation propagation.

Sampling. These methods try to actually sample from the posterior exactly. If the problem is low dimensional, we can use numerical integration directly, e.g., Trapz, Quadrature or Bayesian Quadrature. However, often times the problem is not. So there are smarter sampling methods like MC, MHMC, Gibbs, MCMC, HMC, and Particle Filters.

Local Methods¶

Mean Squared Error (MSE)¶

In the case of regression, we can use the MSE as a loss function. This will exactly solve for the negative log-likelihood term above.

Sources:

Intro to Quantitative Econ w. Python

Maximum A Priori (MAP)¶

Loss Function¶

\boldsymbol{\theta}_{\text{MAP}} = \operatorname*{argmax}_{\boldsymbol{\theta}} - \frac{1}{N}\sum_n^N\log p\left(y_n|f(x_n; \theta)\right) + \log p(\theta)

(10)

Maximum Likelihood Estimation (MLE)¶

Loss Function¶

\theta_{map} = \operatorname*{argmin}_{\theta} - \frac{1}{N}\sum_n^N \log p(y_n| f(x_n;\theta))

(19)

KL-Divergence (Forward)¶

\text{D}_{\text{KL}}\left[ p_*(x) || p(x;\theta) \right] = \mathbb{E}_{x\sim p_*}\left[ \log \frac{p_*(x)}{p(x;\theta)}\right]

(22)

This is the distance between the best distribution, $p_*(x)$ , for the data and the parameterized version, $p(x;\theta)$ .

There is an equivalence between the (Forward) KL-Divergence and the Maximum Likelihood Estimation. Maximizing the likelihood expresses it as maximizing the likelihood of the data given our estimated distribution. Whereas the KL-divergence is a distance measure between the parameterized distribution and the “true” or “best” distribution of the real data. They are equivalent formulations but the MLE equations shows how this is a proxy for fitting the “real” data distribution to the estimated distribution function.

Proof

\text{D}_{\text{KL}}\left[ p_*(x) || p(x;\theta) \right] = \mathbb{E}_{x\sim p_*}\left[ \log \frac{p_*(x)}{p(x;\theta)}\right]

(23)

We can expand this term via logs

\mathbb{E}_{x\sim p_*}\left[ \log \frac{p_*(x)}{p(x;\theta)}\right] = \mathbb{E}_{x\sim p_*}\left[ \log p_*(x) - \log p(x;\theta) \right]

(24)

The first expectation, $\mathbb{E}_{x\sim p_*}[p_*(x)]$ , is the entropy term (i.e. the expected uncertainty in the data). This is a constant term because no matter how well we estimate this distribution via our parameterized representation, $p(x;\theta)$ , this term will not change. So we can ignore this term in our loss function.

\mathbb{E}_{x\sim p_*}\left[ \log \frac{p_*(x)}{p(x;\theta)}\right] = -\mathbb{E}_{x\sim p_*}\left[ \log p(x;\theta)\right]

(25)

We can rewrite this in its integral form:

-\mathbb{E}_{x\sim p_*}\left[ \log p(x;\theta)\right] = - \int \log p(x;\theta) p_*(x)dx

(26)

We will assume that the data distribution is a delta function, $p_*(x) = \delta (x - x_i)$ . This means that each data point is represented equally. If we plug that into our model, we see that it is

-\int \log p(x;\theta) p_*(x)dx = - \int \log p(x;\theta) \delta (x - x_i)dx

(27)

We will do the same approximation of the integral with samples from our delta distribution.

-\int \log p(x;\theta) \delta (x - x_i)dx = - \frac{1}{N}\sum_n^N \log p(x_n;\theta)

(28)

So we have:

\text{D}_{\text{KL}}\left[ p_*(x) || p(x;\theta) \right] = - \frac{1}{N}\sum_n^N \log p(x_n;\theta) = \mathcal{L}_{NLL}(\theta)

(29)

which exactly the function for the NLL Loss

Laplace Approximation¶

This is where we approximate the posterior with a Gaussian distribution $\mathcal{N}(\mu, A^{-1})$ .

$w=w_{map}$ , finds a mode (local max) of $p(w|D)$
$A = \nabla\nabla \log p(D|w) p(w)$ - very expensive calculation
Only captures a single mode and discards the probability mass
- similar to the KLD in one direction.

Sources

Modern Arts of Laplace Approximation - Agustinus - Blog

Variational Inference¶

Definition: We can find the best approximation within a given family w.r.t. KL-Divergence.

\text{KLD}[q||p] = \int_w q(w) \log \frac{q(w)}{p(w|D)}dw

(30)

Let $q(w)=\mathcal{N}(\mu, S)$ and then we minimize KLD $(q||p)$ to find the parameters $\mu, S$ .

“Approximate the posterior, not the model” - James Hensman.

We write out the marginal log-likelihood term for our observations, $y$ .

\log p(y;\theta) = \mathbb{E}_{x \sim p(x|y;\theta)}\left[ \log p(y|\theta) \right]

(31)

We can expand this term using Bayes rule: $p(y) = p(x,y)p(x|y)$ .

\log p(y;\theta) = \mathbb{E}_{x \sim p(x|y;\theta)}\left[ \log \underbrace{p(x,y;\theta)}_{prior} - \log \underbrace{p(x|y;\theta)}_{posterior}\right]

(32)

where $p(x,y;\theta)$ is the joint distribution function and $p(x|y;\theta)$ is the posterior distribution function.

We can use a variational distribution, $q(x|y;\phi)$ which will approximate the

\log p(y;\theta) \geq \mathcal{L}_{ELBO}(\theta,\phi)

(33)

where $\mathcal{L}_{ELBO}$ is the Evidence Lower Bound (ELBO) term. This serves as an upper bound to the true marginal likelihood.

\mathcal{L}_{ELBO}(\theta,\phi) = \mathbb{E}_{q(x|y;\phi)}\left[ \log p(x,y;\theta) - \log q(x|y;\phi) \right]

(34)

we can rewrite this to single out the expectations. This will result in two important quantities.

\mathcal{L}_{ELBO}(\theta,\phi) = \underbrace{\mathbb{E}_{q(x|y;\phi)}\left[ \log p(x,y;\theta)\right]}_{\text{Reconstruction}} - \underbrace{\text{D}_{\text{KL}}\left[ \log q(x|y;\phi) || p(x;\theta)\right]}_{\text{Regularization}}

(35)

Sampling Methods¶

Monte Carlo¶

We can produce samples from the exact posterior by defining a specific Monte Carlo chain.

We actually do this in practice with NNs because of the stochastic training regimes. We modify the SGD algorithm to define a scalable MCMC sampler.

Here is a visual demonstration of some popular MCMC samplers.

Markov Chain Monte Carlo¶

Hamiltonian Monte Carlo¶

Stochastic Gradient Langevin Dynamics¶

Bayesian Inference

Quick Summary¶

Local Methods¶

Mean Squared Error (MSE)¶

Maximum A Priori (MAP)¶

Loss Function¶

Maximum Likelihood Estimation (MLE)¶

Loss Function¶

KL-Divergence (Forward)¶

Laplace Approximation¶

Variational Inference¶

Sampling Methods¶

Monte Carlo¶

Markov Chain Monte Carlo¶

Hamiltonian Monte Carlo¶

Stochastic Gradient Langevin Dynamics¶