Quick Summary ¶ Inference refers to how we find the posterior distribution.
In the previous section, we’ve already seen how we end up with a posterior distribution of the form
p ( θ ∣ D , M ) = 1 Z p ( D ∣ θ , M ) p ( θ ) p(\boldsymbol{\theta}|\mathcal{D},\mathcal{M}) = \frac{1}{Z}
p(\mathcal{D}|\boldsymbol{\theta},\mathcal{M})p(\boldsymbol{\theta}) p ( θ ∣ D , M ) = Z 1 p ( D ∣ θ , M ) p ( θ ) where Z = ∫ D p ( D ∣ θ , M ) p ( θ ) d D Z=\int_\mathcal{D}p(\mathcal{D}|\boldsymbol{\theta},\mathcal{M})p(\boldsymbol{\theta})d\mathcal{D} Z = ∫ D p ( D ∣ θ , M ) p ( θ ) d D is the marginal likelihood.
The problem term within this posterior distribution is the marginal likelihood term , Z Z Z , because it involves solving an integral.
In general, integration is a very hard problem, especially in high-dimensional settings.
The entire literature of inference revolves around how to
Conjugate Methods .
These methods include methods which enable us to solve for this integral exactly.
Normally when things are Gaussian-like and linear, we can just do complex linear algebra to obtain an exact expression for the marginal likelihood.
Local Methods .
These methods are local approximations whereby we look for a single mode of the (potentially) multi-modal posterior distribution.
These methods convert the integration problem into an optimization problems.
These family of methods include MAP estimation, Laplacian approximation, variational inference, and expectation propagation.
Sampling .
These methods try to actually sample from the posterior exactly.
If the problem is low dimensional, we can use numerical integration directly, e.g., Trapz, Quadrature or Bayesian Quadrature.
However, often times the problem is not.
So there are smarter sampling methods like MC, MHMC , Gibbs, MCMC , HMC , and Particle Filters.
Local Methods ¶ Mean Squared Error (MSE) ¶ In the case of regression, we can use the MSE as a loss function. This will exactly solve for the negative log-likelihood term above.
Proof
The likelihood of our model is:
log p ( y ∣ X , w ) = ∑ i = 1 N log p ( y i ∣ x i , θ ) \log p(y|\mathbf{X,w}) = \sum_{i=1}^N \log p(y_i|x_i,\theta) log p ( y ∣ X , w ) = i = 1 ∑ N log p ( y i ∣ x i , θ ) And for simplicity, we assume the noise ε comes from a Gaussian distribution and that it is constant. So we can rewrite our likelihood as
log p ( y ∣ X , w ) = ∑ i = 1 N log N ( y i ∣ x i w , σ 2 ) \log p(y|\mathbf{X,w}) = \sum_{i=1}^N \log \mathcal{N}(y_i | \mathbf{x}_i\mathbf{w}, \sigma^2) log p ( y ∣ X , w ) = i = 1 ∑ N log N ( y i ∣ x i w , σ 2 ) Plugging in the full formula for the Gaussian distribution with some simplifications gives us:
log p ( y ∣ X , w ) = ∑ i = 1 N log 1 2 π σ e 2 exp ( − ( y i − x i w ) 2 2 σ e 2 ) \log p(y|\mathbf{X,w}) =
\sum_{i=1}^N
\log \frac{1}{\sqrt{2 \pi \sigma_e^2}}
\exp\left( - \frac{(y_i - \mathbf{x}_i\mathbf{w})^2}{2\sigma_e^2} \right) log p ( y ∣ X , w ) = i = 1 ∑ N log 2 π σ e 2 1 exp ( − 2 σ e 2 ( y i − x i w ) 2 ) We can use the log rule log a b = log a + log b \log ab = \log a + \log b log ab = log a + log b to rewrite this expression to separate the constant term from the exponential. Also, log e x = x \log e^x = x log e x = x .
log p ( y ∣ X , w ) = − N 2 log 2 π σ e 2 − ∑ i = 1 N ( y i − x i w ) 2 2 σ e 2 \log p(y|\mathbf{X,w}) = - \frac{N}{2} \log 2 \pi \sigma_e^2 - \sum_{i=1}^N \frac{(y_i - \mathbf{x_iw})^2}{2\sigma_e^2} log p ( y ∣ X , w ) = − 2 N log 2 π σ e 2 − i = 1 ∑ N 2 σ e 2 ( y i − x i w ) 2 So, the first term is constant so that we can ignore that in our loss function. We can do the same for the denominator for the second term. Let’s simplify it to make our life easier.
log p ( y ∣ X , w ) = − ∑ i = 1 N ( y i − x i w ) 2 \log p(y|\mathbf{X,w}) = - \sum_{i=1}^N (y_i - \mathbf{x}_i\mathbf{w})^2 log p ( y ∣ X , w ) = − i = 1 ∑ N ( y i − x i w ) 2 So we want to maximize this quantity: in other words, I want to find the parameter w \mathbf{w} w s.t. this equation is maximum.
w M L E = argmin w − ∑ i = 1 N ( y i − x i w ) 2 \mathbf{w}_{MLE} = \operatorname*{argmin}_{\mathbf{w}} - \sum_{i=1}^N (y_i - \mathbf{x}_i\mathbf{w})^2 w M L E = w argmin − i = 1 ∑ N ( y i − x i w ) 2 We can rewrite this expression because the maximum of a negative quantity is the same as minimizing a positive quantity.
w M L E = argmin w 1 N ∑ i = 1 N ( y i − x i w ) 2 \mathbf{w}_{MLE} = \operatorname*{argmin}_{\mathbf{w}} \frac{1}{N} \sum_{i=1}^N (y_i - \mathbf{x}_i\mathbf{w})^2 w M L E = w argmin N 1 i = 1 ∑ N ( y i − x i w ) 2 This is the same as the MSE error expression; with the edition of a scalar value 1 / N 1/N 1/ N .
w M L E = argmin w 1 N ∑ i = 1 N ( y i − x i w ) 2 = argmin w MSE \begin{aligned}
\mathbf{w}_{MLE} &= \operatorname*{argmin}_{\mathbf{w}} \frac{1}{N} \sum_{i=1}^N (y_i - \mathbf{x}_i\mathbf{w})^2 \\
&= \operatorname*{argmin}_{\mathbf{w}} \text{MSE}
\end{aligned} w M L E = w argmin N 1 i = 1 ∑ N ( y i − x i w ) 2 = w argmin MSE Note : If we did not know σ y 2 \sigma_y^2 σ y 2 then we would have to optimize this as well.
Sources :
Maximum A Priori (MAP ) ¶ Loss Function ¶ θ MAP = argmax θ − 1 N ∑ n N log p ( y n ∣ f ( x n ; θ ) ) + log p ( θ ) \boldsymbol{\theta}_{\text{MAP}} = \operatorname*{argmax}_{\boldsymbol{\theta}} - \frac{1}{N}\sum_n^N\log p\left(y_n|f(x_n; \theta)\right) + \log p(\theta) θ MAP = θ argmax − N 1 n ∑ N log p ( y n ∣ f ( x n ; θ ) ) + log p ( θ ) Proof
θ MAP = argmax θ log p ( θ ∣ D ) \boldsymbol{\theta}_{\text{MAP}} = \operatorname*{argmax}_{\boldsymbol{\theta}} \log p(\boldsymbol{\theta}|\mathcal{D}) θ MAP = θ argmax log p ( θ ∣ D ) We can plug in the base Bayesian formulation
θ MAP = argmax θ log [ p ( D ∣ θ ) p ( θ ) p ( D ) ] \boldsymbol{\theta}_{\text{MAP}} = \operatorname*{argmax}_{\boldsymbol{\theta}} \log \left[ \frac{p(\mathcal{D}|\boldsymbol{\theta})p(\boldsymbol{\theta})}{p(\mathcal{D})} \right] θ MAP = θ argmax log [ p ( D ) p ( D ∣ θ ) p ( θ ) ] We can expand this term using the log rules
θ m a p = argmax θ [ log p ( D ∣ θ ) + log p ( θ ) − log p ( D ) ] \theta_{map} = \operatorname*{argmax}_\theta \left[ \log p(D|\theta) + \log p(\theta) - \log p(D) \right] θ ma p = θ argmax [ log p ( D ∣ θ ) + log p ( θ ) − log p ( D ) ] Notice that log p ( D ) \log p(D) log p ( D ) is a constant as the distribution of the data won’t change. It also does not depend on the parameters, θ. So we can cancel that term out.
θ m a p = argmax θ [ log p ( D ∣ θ ) + log p ( θ ) ] \theta_{map} = \operatorname*{argmax}_\theta \left[ \log p(D|\theta) + \log p(\theta) \right] θ ma p = θ argmax [ log p ( D ∣ θ ) + log p ( θ ) ] We will change this problem into a minimization problem instead of maximization
θ m a p = argmin θ − log p ( D ∣ θ ) \theta_{map} = \operatorname*{argmin}_{\theta} - \log p(D|\theta) θ ma p = θ argmin − log p ( D ∣ θ ) We cannot find the probability distribution of p ( D ∣ θ ) p(D|\theta) p ( D ∣ θ ) irregardless of what it is conditioned on. So we need to take some sort of expectations over the entire data.
θ m a p = argmin θ − E x ∼ P X [ log p ( D ∣ θ ) ] + log p ( θ ) \theta_{map} = \operatorname*{argmin}_{\theta} - \mathbb{E}_{\mathbf{x}\sim P_X} \left[ \log p(D|\theta)\right] + \log p(\theta) θ ma p = θ argmin − E x ∼ P X [ log p ( D ∣ θ ) ] + log p ( θ ) We can approximate this using Monte carlo samples. This is given by:
E x [ log p ( D ∣ θ ) ] ≈ 1 N ∑ n N p ( y n ∣ f ( x n ; θ ) ) \mathbb{E}_{x}[\log p(D|\theta)] \approx \frac{1}{N}\sum_n^N p(y_n | f(x_n; \theta)) E x [ log p ( D ∣ θ )] ≈ N 1 n ∑ N p ( y n ∣ f ( x n ; θ )) and we assume that with enough samples, we will capture the essence of our data.
θ m a p = argmin θ − 1 N ∑ n N log p ( y n ∣ f ( x n ; θ ) ) + log p ( θ ) \theta_{map} = \operatorname*{argmin}_{\theta} - \frac{1}{N}\sum_n^N \log p(y_n| f(x_n;\theta))+ \log p(\theta) θ ma p = θ argmin − N 1 n ∑ N log p ( y n ∣ f ( x n ; θ )) + log p ( θ ) Maximum Likelihood Estimation (MLE ) ¶ Loss Function ¶ θ m a p = argmin θ − 1 N ∑ n N log p ( y n ∣ f ( x n ; θ ) ) \theta_{map} = \operatorname*{argmin}_{\theta} - \frac{1}{N}\sum_n^N \log p(y_n| f(x_n;\theta)) θ ma p = θ argmin − N 1 n ∑ N log p ( y n ∣ f ( x n ; θ )) Proof
This is straightforward to derive because we can pick up from the proof of the MAP loss function, eq:(10) .
θ m a p = argmin θ − 1 N ∑ n N log p ( y n ∣ f ( x n ; θ ) ) + log p ( θ ) \theta_{map} = \operatorname*{argmin}_{\theta} - \frac{1}{N}\sum_n^N \log p(y_n| f(x_n;\theta))+ \log p(\theta) θ ma p = θ argmin − N 1 n ∑ N log p ( y n ∣ f ( x n ; θ )) + log p ( θ ) In this case, we will assume a uniform prior on our parameters, θ. This means that any parameter value would work to solve the problem. The uniform distribution has a constant probability of 1. As a result, the log \log log of p ( θ ) = 1 p(\theta)=1 p ( θ ) = 1 is equal to 0. So we can simply remove the log prior on our parameters in the above equation.
θ m a p = argmin θ − 1 N ∑ n N log p ( y n ∣ f ( x n ; θ ) ) \theta_{map} = \operatorname*{argmin}_{\theta} - \frac{1}{N}\sum_n^N \log p(y_n| f(x_n;\theta)) θ ma p = θ argmin − N 1 n ∑ N log p ( y n ∣ f ( x n ; θ )) KL-Divergence (Forward) ¶ D KL [ p ∗ ( x ) ∣ ∣ p ( x ; θ ) ] = E x ∼ p ∗ [ log p ∗ ( x ) p ( x ; θ ) ] \text{D}_{\text{KL}}\left[ p_*(x) || p(x;\theta) \right] = \mathbb{E}_{x\sim p_*}\left[ \log \frac{p_*(x)}{p(x;\theta)}\right] D KL [ p ∗ ( x ) ∣∣ p ( x ; θ ) ] = E x ∼ p ∗ [ log p ( x ; θ ) p ∗ ( x ) ] This is the distance between the best distribution, p ∗ ( x ) p_*(x) p ∗ ( x ) , for the data and the parameterized version, p ( x ; θ ) p(x;\theta) p ( x ; θ ) .
There is an equivalence between the (Forward) KL-Divergence and the Maximum Likelihood Estimation. Maximizing the likelihood expresses it as maximizing the likelihood of the data given our estimated distribution. Whereas the KL-divergence is a distance measure between the parameterized distribution and the “true” or “best” distribution of the real data. They are equivalent formulations but the MLE equations shows how this is a proxy for fitting the “real” data distribution to the estimated distribution function.
Proof
D KL [ p ∗ ( x ) ∣ ∣ p ( x ; θ ) ] = E x ∼ p ∗ [ log p ∗ ( x ) p ( x ; θ ) ] \text{D}_{\text{KL}}\left[ p_*(x) || p(x;\theta) \right] = \mathbb{E}_{x\sim p_*}\left[ \log \frac{p_*(x)}{p(x;\theta)}\right] D KL [ p ∗ ( x ) ∣∣ p ( x ; θ ) ] = E x ∼ p ∗ [ log p ( x ; θ ) p ∗ ( x ) ] We can expand this term via logs
E x ∼ p ∗ [ log p ∗ ( x ) p ( x ; θ ) ] = E x ∼ p ∗ [ log p ∗ ( x ) − log p ( x ; θ ) ] \mathbb{E}_{x\sim p_*}\left[ \log \frac{p_*(x)}{p(x;\theta)}\right] = \mathbb{E}_{x\sim p_*}\left[ \log p_*(x) - \log p(x;\theta) \right] E x ∼ p ∗ [ log p ( x ; θ ) p ∗ ( x ) ] = E x ∼ p ∗ [ log p ∗ ( x ) − log p ( x ; θ ) ] The first expectation, E x ∼ p ∗ [ p ∗ ( x ) ] \mathbb{E}_{x\sim p_*}[p_*(x)] E x ∼ p ∗ [ p ∗ ( x )] , is the entropy term (i.e. the expected uncertainty in the data). This is a constant term because no matter how well we estimate this distribution via our parameterized representation, p ( x ; θ ) p(x;\theta) p ( x ; θ ) , this term will not change. So we can ignore this term in our loss function.
E x ∼ p ∗ [ log p ∗ ( x ) p ( x ; θ ) ] = − E x ∼ p ∗ [ log p ( x ; θ ) ] \mathbb{E}_{x\sim p_*}\left[ \log \frac{p_*(x)}{p(x;\theta)}\right] = -\mathbb{E}_{x\sim p_*}\left[ \log p(x;\theta)\right] E x ∼ p ∗ [ log p ( x ; θ ) p ∗ ( x ) ] = − E x ∼ p ∗ [ log p ( x ; θ ) ] We can rewrite this in its integral form:
− E x ∼ p ∗ [ log p ( x ; θ ) ] = − ∫ log p ( x ; θ ) p ∗ ( x ) d x -\mathbb{E}_{x\sim p_*}\left[ \log p(x;\theta)\right] = - \int \log p(x;\theta) p_*(x)dx − E x ∼ p ∗ [ log p ( x ; θ ) ] = − ∫ log p ( x ; θ ) p ∗ ( x ) d x We will assume that the data distribution is a delta function, p ∗ ( x ) = δ ( x − x i ) p_*(x) = \delta (x - x_i) p ∗ ( x ) = δ ( x − x i ) . This means that each data point is represented equally. If we plug that into our model, we see that it is
− ∫ log p ( x ; θ ) p ∗ ( x ) d x = − ∫ log p ( x ; θ ) δ ( x − x i ) d x -\int \log p(x;\theta) p_*(x)dx = - \int \log p(x;\theta) \delta (x - x_i)dx − ∫ log p ( x ; θ ) p ∗ ( x ) d x = − ∫ log p ( x ; θ ) δ ( x − x i ) d x We will do the same approximation of the integral with samples from our delta distribution.
− ∫ log p ( x ; θ ) δ ( x − x i ) d x = − 1 N ∑ n N log p ( x n ; θ ) -\int \log p(x;\theta) \delta (x - x_i)dx = - \frac{1}{N}\sum_n^N \log p(x_n;\theta) − ∫ log p ( x ; θ ) δ ( x − x i ) d x = − N 1 n ∑ N log p ( x n ; θ ) So we have:
D KL [ p ∗ ( x ) ∣ ∣ p ( x ; θ ) ] = − 1 N ∑ n N log p ( x n ; θ ) = L N L L ( θ ) \text{D}_{\text{KL}}\left[ p_*(x) || p(x;\theta) \right] = - \frac{1}{N}\sum_n^N \log p(x_n;\theta) = \mathcal{L}_{NLL}(\theta) D KL [ p ∗ ( x ) ∣∣ p ( x ; θ ) ] = − N 1 n ∑ N log p ( x n ; θ ) = L N LL ( θ ) which exactly the function for the NLL Loss
Laplace Approximation ¶ This is where we approximate the posterior with a Gaussian distribution N ( μ , A − 1 ) \mathcal{N}(\mu, A^{-1}) N ( μ , A − 1 ) .
w = w m a p w=w_{map} w = w ma p , finds a mode (local max) of p ( w ∣ D ) p(w|D) p ( w ∣ D ) A = ∇ ∇ log p ( D ∣ w ) p ( w ) A = \nabla\nabla \log p(D|w) p(w) A = ∇∇ log p ( D ∣ w ) p ( w ) - very expensive calculationOnly captures a single mode and discards the probability masssimilar to the KLD in one direction. Sources
Variational Inference ¶ Definition : We can find the best approximation within a given family w.r.t. KL-Divergence.
KLD [ q ∣ ∣ p ] = ∫ w q ( w ) log q ( w ) p ( w ∣ D ) d w
\text{KLD}[q||p] = \int_w q(w) \log \frac{q(w)}{p(w|D)}dw KLD [ q ∣∣ p ] = ∫ w q ( w ) log p ( w ∣ D ) q ( w ) d w Let q ( w ) = N ( μ , S ) q(w)=\mathcal{N}(\mu, S) q ( w ) = N ( μ , S ) and then we minimize KLD( q ∣ ∣ p ) (q||p) ( q ∣∣ p ) to find the parameters μ , S \mu, S μ , S .
“Approximate the posterior, not the model” - James Hensman.
We write out the marginal log-likelihood term for our observations, y y y .
log p ( y ; θ ) = E x ∼ p ( x ∣ y ; θ ) [ log p ( y ∣ θ ) ] \log p(y;\theta) = \mathbb{E}_{x \sim p(x|y;\theta)}\left[ \log p(y|\theta) \right] log p ( y ; θ ) = E x ∼ p ( x ∣ y ; θ ) [ log p ( y ∣ θ ) ] We can expand this term using Bayes rule: p ( y ) = p ( x , y ) p ( x ∣ y ) p(y) = p(x,y)p(x|y) p ( y ) = p ( x , y ) p ( x ∣ y ) .
log p ( y ; θ ) = E x ∼ p ( x ∣ y ; θ ) [ log p ( x , y ; θ ) ⏟ p r i o r − log p ( x ∣ y ; θ ) ⏟ p o s t e r i o r ] \log p(y;\theta) = \mathbb{E}_{x \sim p(x|y;\theta)}\left[ \log \underbrace{p(x,y;\theta)}_{prior} - \log \underbrace{p(x|y;\theta)}_{posterior}\right] log p ( y ; θ ) = E x ∼ p ( x ∣ y ; θ ) ⎣ ⎡ log p r i or p ( x , y ; θ ) − log p os t er i or p ( x ∣ y ; θ ) ⎦ ⎤ where p ( x , y ; θ ) p(x,y;\theta) p ( x , y ; θ ) is the joint distribution function and p ( x ∣ y ; θ ) p(x|y;\theta) p ( x ∣ y ; θ ) is the posterior distribution function.
We can use a variational distribution, q ( x ∣ y ; ϕ ) q(x|y;\phi) q ( x ∣ y ; ϕ ) which will approximate the
log p ( y ; θ ) ≥ L E L B O ( θ , ϕ ) \log p(y;\theta) \geq \mathcal{L}_{ELBO}(\theta,\phi) log p ( y ; θ ) ≥ L E L BO ( θ , ϕ ) where L E L B O \mathcal{L}_{ELBO} L E L BO is the Evidence Lower Bound (ELBO) term. This serves as an upper bound to the true marginal likelihood.
L E L B O ( θ , ϕ ) = E q ( x ∣ y ; ϕ ) [ log p ( x , y ; θ ) − log q ( x ∣ y ; ϕ ) ] \mathcal{L}_{ELBO}(\theta,\phi) = \mathbb{E}_{q(x|y;\phi)}\left[ \log p(x,y;\theta) - \log q(x|y;\phi) \right] L E L BO ( θ , ϕ ) = E q ( x ∣ y ; ϕ ) [ log p ( x , y ; θ ) − log q ( x ∣ y ; ϕ ) ] we can rewrite this to single out the expectations. This will result in two important quantities.
L E L B O ( θ , ϕ ) = E q ( x ∣ y ; ϕ ) [ log p ( x , y ; θ ) ] ⏟ Reconstruction − D KL [ log q ( x ∣ y ; ϕ ) ∣ ∣ p ( x ; θ ) ] ⏟ Regularization \mathcal{L}_{ELBO}(\theta,\phi) = \underbrace{\mathbb{E}_{q(x|y;\phi)}\left[ \log p(x,y;\theta)\right]}_{\text{Reconstruction}} - \underbrace{\text{D}_{\text{KL}}\left[ \log q(x|y;\phi) || p(x;\theta)\right]}_{\text{Regularization}} L E L BO ( θ , ϕ ) = Reconstruction E q ( x ∣ y ; ϕ ) [ log p ( x , y ; θ ) ] − Regularization D KL [ log q ( x ∣ y ; ϕ ) ∣∣ p ( x ; θ ) ] Sampling Methods ¶ Monte Carlo ¶ We can produce samples from the exact posterior by defining a specific Monte Carlo chain.
We actually do this in practice with NNs because of the stochastic training regimes. We modify the SGD algorithm to define a scalable MCMC sampler.
Here is a visual demonstration of some popular MCMC samplers.
Markov Chain Monte Carlo ¶ Hamiltonian Monte Carlo ¶ Stochastic Gradient Langevin Dynamics ¶