Models

Models#

Model

\[ \mathbf{y} = f(\mathbf{x}; \boldsymbol{\theta}) + \boldsymbol{\epsilon} \]

Measurement Model

\[ p(\mathbf{y}|\mathbf{x}; \boldsymbol{\theta}) \sim \mathcal{N}(\mathbf{y}|\boldsymbol{f}(\mathbf{x};\boldsymbol{\theta}), \sigma^2) \]

Likelihood Loss Function

\[ \log p(\mathbf{y}|\mathbf{x}; \boldsymbol{\theta}) \]

Loss Function

\[ \mathcal{L}(\boldsymbol{\theta}) = - \frac{1}{2\sigma^2}||\mathbf{y} - f(\mathbf{x};\boldsymbol{\theta})||_2^2 - \log p(\mathbf{x};\boldsymbol{\theta}) \]

Error Minimization#

We choose an error measure or loss function, \(\mathcal{L}\), to minimize wrt the parameters, \(\theta\).

\[ \mathcal{L}(\theta) = \sum_{i=1}^N \left[ f(x_i;\theta) - y \right]^2 \]

We typically add some sort of regularization in order to constrain the solution

\[ \mathcal{L}(\theta) = \sum_{i=1}^N \left[ f(x_i;\theta) - y \right]^2 + \lambda \mathcal{R}(\theta) \]

Probabilistic Approach#

We explicitly account for noise in our model.

\[ y = f(x;\theta) + \epsilon(x) \]

where \(\epsilon\) is the noise. The simplest noise assumption we see in many approaches is the iid Gaussian noise.

\[ \epsilon(x) \sim \mathcal{N}(0, \sigma^2) \]

So given our standard Bayesian formulation for the posterior

\[ p(\theta|\mathcal{D}) \propto p(y|x,\theta)p(\theta) \]

we assume a Gaussian observation model

\[ p(y|x,\theta) = \mathcal{N}(y;f(x;\theta), \sigma^2) \]

and in turn a likelihood model

\[ p(y|x;\theta) = \prod_{i=1}^N \mathcal{N}(y_i; f(x_i; \theta), \sigma^2) \]

Objective: maximize the likelihood of the data, \(\mathcal{D}\) wrt the parameters, \(\theta\).

Note: For a Gaussian noise model (what we have assumed above), this approach will use the same predictions as the MSE loss function (that we saw above).

\[ \log p(y|x,\theta) \propto - \frac{1}{2\sigma^2}\sum_{i=1}^N \left[ y_i - f(x_i;\theta)\right] \]

We can simplify the notion a bit to make it more compact. This essentially puts all of the observations together so that we can use vectorized representations, i.e. \(\mathcal{D} = \{ x_i, y_i\}_{i=1}^N\)

\[\begin{split} \begin{aligned} \log p(\mathbf{y}|\mathbf{x},\theta) &= - \frac{1}{2\sigma^2} \left(\mathbf{y} - \boldsymbol{f}(\mathbf{x};\theta)\right)^\top\left(\mathbf{y} - \boldsymbol{f}(\mathbf{x};\theta) \right) \\ &= - \frac{1}{2\sigma^2} ||\mathbf{y} - \boldsymbol{f}(\mathbf{x};\theta)||_2^2 \end{aligned} \end{split}\]

where \(||\cdot ||_2^2\) is the Maholanobis Distance.

Note: we often see this notation in many papers and books.

Priors#

Different Parameterizations#

Model	Equation
Identity	\( \mathbf{x}\)
Linear	\(\mathbf{wx}+\mathbf{b}\)
Basis	\(\mathbf{w}\boldsymbol{\phi}(\mathbf{x}) + \mathbf{b}\)
Non-Linear	\(\sigma\left( \mathbf{wx} + \mathbf{b}\right)\)
Neural Network	\(\boldsymbol{f}_{L}\circ \boldsymbol{f}_{L-1}\circ\ldots\circ\boldsymbol{f}_1\)
Functional	\(\boldsymbol{f} \sim \mathcal{GP}\left(\boldsymbol{\mu}_{\boldsymbol \alpha}(\mathbf{x}),\boldsymbol{\sigma}^2_{\boldsymbol \alpha}(\mathbf{x})\right)\)

Models

Contents

Models#

Error Minimization#

Probabilistic Approach#

Priors#

Different Parameterizations#

Identity#

Linear#

Basis Function#

Non-Linear Function#

Generic#

Generic (Heteroscedastic)#