Skip to article frontmatterSkip to article content

Bayesian Modeling

CSIC
UCM
IGEO

Rules

Product Rule

p(x,y)=p(xy)p(y)=p(yx)p(x)p(x,y) = p(x|y)p(y) = p(y|x)p(x)

Sum Rule

p(y)=Xp(x,y)dx=Xp(xy)p(y)dxp(y) = \int_\mathcal{X}p(x,y)dx=\int_\mathcal{X}p(x|y)p(y)dx

Bayes Rules

p(xy)=1Zp(yx)p(x)p(x|y) = \frac{1}{Z}p(y|x)p(x)

where Z:=p(y)=Xp(y,x)dxZ:=p(y)=\int_\mathcal{X}p(y,x)dx

Predictive Posterior

p(uu)=θp(uθ)p(θu)dθ=1Nn=1Np(uθn),θnp(θu)n=1,2,,N\begin{aligned} p(u^*|u) &= \int_\theta p(u^*|\theta)p(\theta|u)d\theta \\ &= \frac{1}{N}\sum_{n=1}^N p(u^*|\theta_n), && && \theta_n \sim p(\theta|u) && n=1,2,\ldots,N \end{aligned}

Idea

A model is something that links inputs to outputs. If we are given data, XRNxDX \in \mathbb{R}^{NxD}, and observations, yy, we ideally would want to know these two entities are related. That relationship (or transformation) from the data XX to the observations yy is what we would call a model, M\mathcal{M}.

alt text

More concretely, let XRNxDX\in \mathbb{R}^{NxD} and yRNy \in \mathbb{R}^{N} where NN is the number of samples and DD is the number of dimensions/features. In a transformation sense, we could think of it as a function, ff that maps the data from XX to yy, or f:XY,RNxDRNf:\mathbb{X}\rightarrow \mathbb{Y}, \mathbb{R}^{NxD}\rightarrow \mathbb{R}^{N}. To put it simply, we have the following equation to describe our model.

y=f(X)y = f(X)

But if we put a statistical spin on it and say that XX is a random variabe (r.v.), XPX \sim \mathbb{P}. We typically don’t know P\mathbb{P} or else there really would not be a problem. Or even worse, let’s say that there is actually noise in our observation so we’re not entirely 100% sure that each input, xx corresponds to each output, yy. Fortunately, we have mathematics where we can easily find some mathematical framework to transform our problem into a way we can easily solve. In this case, we can use the mathematics of probability theory to express the uncertainty and noise that come with our model, M\mathcal{M}. More specifically, we can use Bayes rule to give us inverse probabilities that allow us to use inference; basically using our data to infer unknown quantities, model aspects and (most importantly) make predictions.

Bayes Rule in Words

In a Machine Learning problem, we almost always have the following components:

  • Data
  • Model which we believe can describe our data,
    • parameters which can be changed/tuned to fit the data
  • Goal
    • Learn the parameters given the data
    • which points belong to which cluster
    • predict function outputs
    • predict future labels
    • predict the lower dimensional embedding/representation

The Bayesian framework works best when you think about it from a probabilistic standpoint.

P( Model  Data )=P( Data  Model )P( Model )P( Data )\begin{aligned}P(\text{ Model }|\text{ Data })= \frac{P(\text{ Data }|\text{ Model })P(\text{ Model })}{P(\text{ Data })}\end{aligned}

I’ve seen some people (here, here) have some sort of equivalence between Model, M\mathcal{M} and Hypothesis, H\mathcal{H}. In this particular instance, think of the M\mathcal{M} as the best possible outcome that we can achieve to map xx to yy correctly. And think of H\mathcal{H} as a set of possible formulas we could use; like in a Universe where we have all of the possible formulas and collection of parameters. I quite like the term Hypothesis because it adds another level of abstraction when thinking about the problem. But at the same time I feel like this extra layer of abstraction is not something I like to think about all of the time.

Let’s break down each of these components.

  • P( Model )P(\text{ Model }) - Prior Probability
  • P( Data )P(\text{ Data } | \text{}) - Evidence, Normalization Constant
  • P( Model  Data )P(\text{ Model } | \text{ Data }) - Posterior Probability
  • P( Data  Model )P(\text{ Data } | \text{ Model }) - Likelihood

Let’s change the notation to something a bit more common.

P(θD,M)=P(Dθ,M)P(θM)P(DM)P(\theta | \mathcal{D}, \mathcal{M})= \frac{P(\mathcal{D}|\theta, \mathcal{M})P(\theta | \mathcal{M})}{P(\mathcal{D}|\mathcal{M})}

where:

  • P(Dθ,M)P(\mathcal{D}|\theta, \mathcal{M}) - Likelihood of the parameters, θ in model M\mathcal{M}

    Likelihood of the parameters (not of the data). For every set of parameters, I can assign a probability to some observable data.

  • P(θM)P(\theta | \mathcal{M}) - prior probability of θ

    This expresses the distribution and the uncertainty of the parameters that define my model. It’s a way of constraining the range of values that can occur. Expert knowledge in this area is crucial if you would like Physics-aware machine learning models.

  • P(DM)P(\mathcal{D}|\mathcal{M}) - The normalization constant (the marginal likelihood)

    This term seems to give us a lot of problems but this is an artifact of Bayes Rule where in order to obtain my Posterior, I need to renormalize.

  • P(θD,M)P(\theta | \mathcal{D,M}) - Posterior of θ given data D\mathcal{D}

    This is often the objective, aka, what we are actually interested in in knowing. We can think of this as an inverse problem because we have the forward connections of prior --> Likelihood, but we’re missing the “reverse” direction.

There are few things that are different. First of all, every single component is conditioned on a model M\mathcal{M}. This is to say, given that I have described my model, here are the configurations that this model requires. So we’re really staying true to the model based Machine Learning instead of the Toolbox method. Also, I’ve changed the data to be denoted as D\mathcal{D} where D={(x1,y1),,(xN,yN)}1N\mathcal{D}=\left\{ (x_1, y_1), \ldots, (x_N, y_N) \right\}^{N}_{1}.

Bayes Rule

p(θD)=1Zp(Dθ)p(θ)p(\boldsymbol{\theta}|\mathcal{D}) = \frac{1}{Z}p(\mathcal{D}|\boldsymbol{\theta})p(\boldsymbol{\theta})

where ZZ is the normalizing coefficient given by the equation:

Z=Dp(Dθ)p(θ)dDZ = \int_\mathcal{D} p(\mathcal{D}|\boldsymbol{\theta})p(\boldsymbol{\theta})d\mathcal{D}

Hierarchical Bayesian Modeling

Data Model:p(Dθ,α)Process Model:p(θα)Parameter Model:p(α)\begin{aligned} \text{Data Model}: && && p(\mathcal{D}|\boldsymbol{\theta},\boldsymbol{\alpha}) \\ \text{Process Model}: && && p(\boldsymbol{\theta}|\boldsymbol{\alpha}) \\ \text{Parameter Model}: && && p(\boldsymbol{\alpha}) \\ \end{aligned}