Skip to article frontmatterSkip to article content

Bayesian Modeling

CSIC
UCM
IGEO

Models

Model

y=f(x;θ)+ϵ\mathbf{y} = f(\mathbf{x}; \boldsymbol{\theta}) + \boldsymbol{\epsilon}

Measurement Model

p(yx;θ)N(yf(x;θ),σ2)p(\mathbf{y}|\mathbf{x}; \boldsymbol{\theta}) \sim \mathcal{N}(\mathbf{y}|\boldsymbol{f}(\mathbf{x};\boldsymbol{\theta}), \sigma^2)

Likelihood Loss Function

logp(yx;θ)\log p(\mathbf{y}|\mathbf{x}; \boldsymbol{\theta})

Loss Function

L(θ)=12σ2yf(x;θ)22logp(x;θ)\mathcal{L}(\boldsymbol{\theta}) = - \frac{1}{2\sigma^2}||\mathbf{y} - f(\mathbf{x};\boldsymbol{\theta})||_2^2 - \log p(\mathbf{x};\boldsymbol{\theta})

Error Minimization

We choose an error measure or loss function, L\mathcal{L}, to minimize wrt the parameters, θ.

L(θ)=i=1N[f(xi;θ)y]2\mathcal{L}(\theta) = \sum_{i=1}^N \left[ f(x_i;\theta) - y \right]^2

We typically add some sort of regularization in order to constrain the solution

L(θ)=i=1N[f(xi;θ)y]2+λR(θ)\mathcal{L}(\theta) = \sum_{i=1}^N \left[ f(x_i;\theta) - y \right]^2 + \lambda \mathcal{R}(\theta)

Probabilistic Approach

We explicitly account for noise in our model.

y=f(x;θ)+ϵ(x)y = f(x;\theta) + \epsilon(x)

where ε is the noise. The simplest noise assumption we see in many approaches is the iid Gaussian noise.

ϵ(x)N(0,σ2)\epsilon(x) \sim \mathcal{N}(0, \sigma^2)

So given our standard Bayesian formulation for the posterior

p(θD)p(yx,θ)p(θ)p(\theta|\mathcal{D}) \propto p(y|x,\theta)p(\theta)

we assume a Gaussian observation model

p(yx,θ)=N(y;f(x;θ),σ2)p(y|x,\theta) = \mathcal{N}(y;f(x;\theta), \sigma^2)

and in turn a likelihood model

p(yx;θ)=i=1NN(yi;f(xi;θ),σ2)p(y|x;\theta) = \prod_{i=1}^N \mathcal{N}(y_i; f(x_i; \theta), \sigma^2)

Objective: maximize the likelihood of the data, D\mathcal{D} wrt the parameters, θ.

Note: For a Gaussian noise model (what we have assumed above), this approach will use the same predictions as the MSE loss function (that we saw above).

logp(yx,θ)12σ2i=1N[yif(xi;θ)]\log p(y|x,\theta) \propto - \frac{1}{2\sigma^2}\sum_{i=1}^N \left[ y_i - f(x_i;\theta)\right]

We can simplify the notion a bit to make it more compact. This essentially puts all of the observations together so that we can use vectorized representations, i.e. D={xi,yi}i=1N\mathcal{D} = \{ x_i, y_i\}_{i=1}^N

logp(yx,θ)=12σ2(yf(x;θ))(yf(x;θ))=12σ2yf(x;θ)22\begin{aligned} \log p(\mathbf{y}|\mathbf{x},\theta) &= - \frac{1}{2\sigma^2} \left(\mathbf{y} - \boldsymbol{f}(\mathbf{x};\theta)\right)^\top\left(\mathbf{y} - \boldsymbol{f}(\mathbf{x};\theta) \right) \\ &= - \frac{1}{2\sigma^2} ||\mathbf{y} - \boldsymbol{f}(\mathbf{x};\theta)||_2^2 \end{aligned}

where 22||\cdot ||_2^2 is the Maholanobis Distance.

Note: we often see this notation in many papers and books.

Priors


Different Parameterizations

ModelEquation
Identityx \mathbf{x}
Linearwx+b\mathbf{wx}+\mathbf{b}
Basiswϕ(x)+b\mathbf{w}\boldsymbol{\phi}(\mathbf{x}) + \mathbf{b}
Non-Linearσ(wx+b)\sigma\left( \mathbf{wx} + \mathbf{b}\right)
Neural NetworkfLfL1f1\boldsymbol{f}_{L}\circ \boldsymbol{f}_{L-1}\circ\ldots\circ\boldsymbol{f}_1
FunctionalfGP(μα(x),σα2(x))\boldsymbol{f} \sim \mathcal{GP}\left(\boldsymbol{\mu}_{\boldsymbol \alpha}(\mathbf{x}),\boldsymbol{\sigma}^2_{\boldsymbol \alpha}(\mathbf{x})\right)

Identity

f(x;θ)=xf(x;\theta) = x
p(yx,θ)N(yx,σ2)p(y|x,\theta) \sim \mathcal{N}(y|x, \sigma^2)
L(θ)=12σ2yx22\mathcal{L}(\theta) = - \frac{1}{2\sigma^2}||y - x||_2^2

Linear

A linear function of w\mathbf{w} wrt x\mathbf{x}.

f(x;θ)=wxf(x;\theta) = w^\top x
p(yx,θ)N(ywx,σ2)p(y|x,\theta) \sim \mathcal{N}(y|w^\top x, \sigma^2)
L(θ)=12σ2ywx22\mathcal{L}(\theta) = - \frac{1}{2\sigma^2}|| y - w^\top x||_2^2

Basis Function

A linear function of w\mathbf{w} wrt to the basis function ϕ(x)\phi(x).

f(x;θ)=wϕ(x;θ)f(x;\theta) = w^\top \phi(x;\theta)

Examples

  • ϕ(x)=(1,x,x2,)\phi(x) = (1, x, x^2, \ldots)
  • ϕ(x)=tanh(x+γ)α\phi(x) = \tanh(x + \gamma)^\alpha
  • ϕ(x)=exp(γxy22)\phi(x) = \exp(- \gamma||x-y||_2^2)
  • ϕ(x)=[sin(2πωx),cos(2πωx)]\phi(x) = \left[\sin(2\pi\boldsymbol{\omega}\mathbf{x}),\cos(2\pi\boldsymbol{\omega}\mathbf{x}) \right]^\top

Prob Formulation

p(yx,θ)N(ywϕ(x),σ2)p(y|x,\theta) \sim \mathcal{N}(y|w^\top \phi(x), \sigma^2)

Likelihood Loss

L(θ)=12σ2ywϕ(x;θ)22\mathcal{L}(\theta) = - \frac{1}{2 \sigma^2} ||y - w^\top \phi(x; \theta) ||_2^2

Non-Linear Function

A non-linear function in x\mathbf{x} and w\mathbf{w}.

f(x;θ)=g(wϕ(x;θϕ))f(x; \theta) = g (w^\top \phi (x; \theta_{\phi}))

Examples

  • Random Forests
  • Neural Networks
  • Gradient Boosting

Prob Formulation

p(yx,θ)N(ywϕ(x),σ2)p(y|x,\theta) \sim \mathcal{N}(y|w^\top \phi(x), \sigma^2)

Likelihood Loss

L(θ)=12σ2yg(wϕ(x))22\mathcal{L}(\theta) = - \frac{1}{2\sigma^2}||y - g(w^\top \phi(x))||_2^2

Generic

A non-linear function in x\mathbf{x} and w\mathbf{w}.

y=f(x;θ)y = f(x; \theta)

Examples

  • Random Forests
  • Neural Networks
  • Gradient Boosting

Prob Formulation

p(yx,θ)N(yf(x;θ),σ2)p(y|x,\theta) \sim \mathcal{N}(y|f(x; \theta), \sigma^2)

Likelihood Loss

L(θ)=12σ2yf(x;θ)22\mathcal{L}(\theta) = - \frac{1}{2\sigma^2}||y - f(x; \theta)||_2^2

Generic (Heteroscedastic)

A non-linear function in x\mathbf{x} and w\mathbf{w}.

y=μ(x;θ)+σ2(x;θ)y = \boldsymbol{\mu}(x; \theta) + \boldsymbol{\sigma}^2(x;\theta)

Examples

  • Random Forests
  • Neural Networks
  • Gradient Boosting

Prob Formulation

p(yx,θ)N(yμ(x;θ),σ2(x;θ))p(y|x,\theta) \sim \mathcal{N}(y|\boldsymbol{\mu}(x; \theta), \boldsymbol{\sigma}^2(x; \theta))

Likelihood Loss

logp(yx,θ)=12logσ2(x;θ)+12yμ(x;θ)σ2(x;θ)2+C-\log p(y|x,\theta) = \frac{1}{2}\log \boldsymbol{\sigma}^2(\mathbf{x};\boldsymbol{\theta}) + \frac{1}{2}||\mathbf{y} - \boldsymbol{\mu}(\mathbf{x};\boldsymbol{\theta})||^2_{\boldsymbol{\sigma}^2(\mathbf{x};\boldsymbol{\theta})} + \text{C}