Inference Schemes#
Source | Deisenroth - Sampling
Advances in VI - Notebook
Numerical Integration (low dimension)
Bayesian Quadrature
Expectation Propagation
Conjugate Priors (Gaussian Likelihood w/ GP Prior)
Subset Methods (Nystrom)
Fast Linear Algebra (Krylov, Fast Transforms, KD-Trees)
Variational Methods (Laplace, Mean-Field, Expectation Propagation)
Monte Carlo Methods (Gibbs, Metropolis-Hashings, Particle Filter)
Local Methods
Sampling Methods
Local Methods#
Mean Squared Error (MSE)#
In the case of regression, we can use the MSE as a loss function. This will exactly solve for the negative log-likelihood term above.
Proof
The likelihood of our model is:
And for simplicity, we assume the noise \(\epsilon\) comes from a Gaussian distribution and that it is constant. So we can rewrite our likelihood as
Plugging in the full formula for the Gaussian distribution with some simplifications gives us:
We can use the log rule \(\log ab = \log a + \log b\) to rewrite this expression to separate the constant term from the exponential. Also, \(\log e^x = x\).
So, the first term is constant so that we can ignore that in our loss function. We can do the same for the denominator for the second term. Let’s simplify it to make our life easier.
So we want to maximize this quantity: in other words, I want to find the parameter \(\mathbf{w}\) s.t. this equation is maximum.
We can rewrite this expression because the maximum of a negative quantity is the same as minimizing a positive quantity.
This is the same as the MSE error expression; with the edition of a scalar value \(1/N\).
Note: If we did not know \(\sigma_y^2\) then we would have to optimize this as well.
Sources:
Maximum A Priori (MAP)#
Loss Function#
Proof
We can plug in the base Bayesian formulation
We can expand this term using the log rules
Notice that \(\log p(D)\) is a constant as the distribution of the data won’t change. It also does not depend on the parameters, \(\theta\). So we can cancel that term out.
We will change this problem into a minimization problem instead of maximization
We cannot find the probability distribution of \(p(D|\theta)\) irregardless of what it is conditioned on. So we need to take some sort of expectations over the entire data.
We can approximate this using Monte carlo samples. This is given by:
and we assume that with enough samples, we will capture the essence of our data.
Maximum Likelihood Estimation (MLE)#
Loss Function#
Proof
This is straightforward to derive because we can pick up from the proof of the MAP loss function, eq:(22).
In this case, we will assume a uniform prior on our parameters, \(\theta\). This means that any parameter value would work to solve the problem. The uniform distribution has a constant probability of 1. As a result, the \(\log\) of \(p(\theta)=1\) is equal to 0. So we can simply remove the log prior on our parameters in the above equation.
You can get an intuition that this will lead to local minimum as there are many possible solutions that would minimize this equation. Or even worse, there are many possible local minimum that we could get stuck in when trying to optimize for this.
KL-Divergence (Forward)#
This is the distance between the best distribution, \(p_*(x)\), for the data and the parameterized version, \(p(x;\theta)\).
There is an equivalence between the (Forward) KL-Divergence and the Maximum Likelihood Estimation. Maximizing the likelihood expresses it as maximizing the likelihood of the data given our estimated distribution. Whereas the KL-divergence is a distance measure between the parameterized distribution and the “true” or “best” distribution of the real data. They are equivalent formulations but the MLE equations shows how this is a proxy for fitting the “real” data distribution to the estimated distribution function.
Proof
We can expand this term via logs
The first expectation, \(\mathbb{E}_{x\sim p_*}[p_*(x)]\), is the entropy term (i.e. the expected uncertainty in the data). This is a constant term because no matter how well we estimate this distribution via our parameterized representation, \(p(x;\theta)\), this term will not change. So we can ignore this term in our loss function.
We can rewrite this in its integral form:
We will assume that the data distribution is a delta function, \(p_*(x) = \delta (x - x_i)\). This means that each data point is represented equally. If we plug that into our model, we see that it is
We will do the same approximation of the integral with samples from our delta distribution.
So we have:
which exactly the function for the NLL Loss
Laplace Approximation#
This is where we approximate the posterior with a Gaussian distribution \(\mathcal{N}(\mu, A^{-1})\).
\(w=w_{map}\), finds a mode (local max) of \(p(w|D)\)
\(A = \nabla\nabla \log p(D|w) p(w)\) - very expensive calculation
Only captures a single mode and discards the probability mass
similar to the KLD in one direction.
Sources
Modern Arts of Laplace Approximation - Agustinus - Blog
Variational Inference#
Definition: We can find the best approximation within a given family w.r.t. KL-Divergence. $\( \text{KLD}[q||p] = \int_w q(w) \log \frac{q(w)}{p(w|D)}dw \)\( Let \)q(w)=\mathcal{N}(\mu, S)\( and then we minimize KLD\)(q||p)\( to find the parameters \)\mu, S$.
“Approximate the posterior, not the model” - James Hensman.
We write out the marginal log-likelihood term for our observations, \(y\).
We can expand this term using Bayes rule: \(p(y) = p(x,y)p(x|y)\).
where \(p(x,y;\theta)\) is the joint distribution function and \(p(x|y;\theta)\) is the posterior distribution function.
We can use a variational distribution, \(q(x|y;\phi)\) which will approximate the
where \(\mathcal{L}_{ELBO}\) is the Evidence Lower Bound (ELBO) term. This serves as an upper bound to the true marginal likelihood.
we can rewrite this to single out the expectations. This will result in two important quantities.
Sampling Methods#
Monte Carlo#
We can produce samples from the exact posterior by defining a specific Monte Carlo chain.
We actually do this in practice with NNs because of the stochastic training regimes. We modify the SGD algorithm to define a scalable MCMC sampler.
Here is a visual demonstration of some popular MCMC samplers.