Models ¶ Model
y = f ( x ; θ ) + ϵ \mathbf{y} = f(\mathbf{x}; \boldsymbol{\theta}) + \boldsymbol{\epsilon} y = f ( x ; θ ) + ϵ Measurement Model
p ( y ∣ x ; θ ) ∼ N ( y ∣ f ( x ; θ ) , σ 2 ) p(\mathbf{y}|\mathbf{x}; \boldsymbol{\theta}) \sim \mathcal{N}(\mathbf{y}|\boldsymbol{f}(\mathbf{x};\boldsymbol{\theta}), \sigma^2) p ( y ∣ x ; θ ) ∼ N ( y ∣ f ( x ; θ ) , σ 2 ) Likelihood Loss Function
log p ( y ∣ x ; θ ) \log p(\mathbf{y}|\mathbf{x}; \boldsymbol{\theta}) log p ( y ∣ x ; θ ) Loss Function
L ( θ ) = − 1 2 σ 2 ∣ ∣ y − f ( x ; θ ) ∣ ∣ 2 2 − log p ( x ; θ ) \mathcal{L}(\boldsymbol{\theta}) = - \frac{1}{2\sigma^2}||\mathbf{y} - f(\mathbf{x};\boldsymbol{\theta})||_2^2 - \log p(\mathbf{x};\boldsymbol{\theta}) L ( θ ) = − 2 σ 2 1 ∣∣ y − f ( x ; θ ) ∣ ∣ 2 2 − log p ( x ; θ ) Error Minimization ¶ We choose an error measure or loss function, L \mathcal{L} L , to minimize wrt the parameters, θ \theta θ .
L ( θ ) = ∑ i = 1 N [ f ( x i ; θ ) − y ] 2 \mathcal{L}(\theta) = \sum_{i=1}^N \left[ f(x_i;\theta) - y \right]^2 L ( θ ) = i = 1 ∑ N [ f ( x i ; θ ) − y ] 2 We typically add some sort of regularization in order to constrain the solution
L ( θ ) = ∑ i = 1 N [ f ( x i ; θ ) − y ] 2 + λ R ( θ ) \mathcal{L}(\theta) = \sum_{i=1}^N \left[ f(x_i;\theta) - y \right]^2 + \lambda \mathcal{R}(\theta) L ( θ ) = i = 1 ∑ N [ f ( x i ; θ ) − y ] 2 + λ R ( θ ) Probabilistic Approach ¶ We explicitly account for noise in our model.
y = f ( x ; θ ) + ϵ ( x ) y = f(x;\theta) + \epsilon(x) y = f ( x ; θ ) + ϵ ( x ) where ϵ \epsilon ϵ is the noise. The simplest noise assumption we see in many approaches is the iid Gaussian noise.
ϵ ( x ) ∼ N ( 0 , σ 2 ) \epsilon(x) \sim \mathcal{N}(0, \sigma^2) ϵ ( x ) ∼ N ( 0 , σ 2 ) So given our standard Bayesian formulation for the posterior
p ( θ ∣ D ) ∝ p ( y ∣ x , θ ) p ( θ ) p(\theta|\mathcal{D}) \propto p(y|x,\theta)p(\theta) p ( θ ∣ D ) ∝ p ( y ∣ x , θ ) p ( θ ) we assume a Gaussian observation model
p ( y ∣ x , θ ) = N ( y ; f ( x ; θ ) , σ 2 ) p(y|x,\theta) = \mathcal{N}(y;f(x;\theta), \sigma^2) p ( y ∣ x , θ ) = N ( y ; f ( x ; θ ) , σ 2 ) and in turn a likelihood model
p ( y ∣ x ; θ ) = ∏ i = 1 N N ( y i ; f ( x i ; θ ) , σ 2 ) p(y|x;\theta) = \prod_{i=1}^N \mathcal{N}(y_i; f(x_i; \theta), \sigma^2) p ( y ∣ x ; θ ) = i = 1 ∏ N N ( y i ; f ( x i ; θ ) , σ 2 ) Objective : maximize the likelihood of the data, D \mathcal{D} D wrt the parameters, θ \theta θ .
Note : For a Gaussian noise model (what we have assumed above), this approach will use the same predictions as the MSE loss function (that we saw above).
log p ( y ∣ x , θ ) ∝ − 1 2 σ 2 ∑ i = 1 N [ y i − f ( x i ; θ ) ] \log p(y|x,\theta) \propto - \frac{1}{2\sigma^2}\sum_{i=1}^N \left[ y_i - f(x_i;\theta)\right] log p ( y ∣ x , θ ) ∝ − 2 σ 2 1 i = 1 ∑ N [ y i − f ( x i ; θ ) ] We can simplify the notion a bit to make it more compact. This essentially puts all of the observations together so that we can use vectorized representations, i.e. D = { x i , y i } i = 1 N \mathcal{D} = \{ x_i, y_i\}_{i=1}^N D = { x i , y i } i = 1 N
log p ( y ∣ x , θ ) = − 1 2 σ 2 ( y − f ( x ; θ ) ) ⊤ ( y − f ( x ; θ ) ) = − 1 2 σ 2 ∣ ∣ y − f ( x ; θ ) ∣ ∣ 2 2 \begin{aligned}
\log p(\mathbf{y}|\mathbf{x},\theta)
&= - \frac{1}{2\sigma^2} \left(\mathbf{y} - \boldsymbol{f}(\mathbf{x};\theta)\right)^\top\left(\mathbf{y} - \boldsymbol{f}(\mathbf{x};\theta) \right) \\
&= - \frac{1}{2\sigma^2} ||\mathbf{y} - \boldsymbol{f}(\mathbf{x};\theta)||_2^2
\end{aligned} log p ( y ∣ x , θ ) = − 2 σ 2 1 ( y − f ( x ; θ ) ) ⊤ ( y − f ( x ; θ ) ) = − 2 σ 2 1 ∣∣ y − f ( x ; θ ) ∣ ∣ 2 2 where ∣ ∣ ⋅ ∣ ∣ 2 2 ||\cdot ||_2^2 ∣∣ ⋅ ∣ ∣ 2 2 is the Maholanobis Distance.
Note : we often see this notation in many papers and books.
Priors ¶ Different Parameterizations ¶ Model Equation Identity x \mathbf{x} x Linear w x + b \mathbf{wx}+\mathbf{b} wx + b Basis w ϕ ( x ) + b \mathbf{w}\boldsymbol{\phi}(\mathbf{x}) + \mathbf{b} w ϕ ( x ) + b Non-Linear σ ( w x + b ) \sigma\left( \mathbf{wx} + \mathbf{b}\right) σ ( wx + b ) Neural Network f L ∘ f L − 1 ∘ … ∘ f 1 \boldsymbol{f}_{L}\circ \boldsymbol{f}_{L-1}\circ\ldots\circ\boldsymbol{f}_1 f L ∘ f L − 1 ∘ … ∘ f 1 Functional f ∼ G P ( μ α ( x ) , σ α 2 ( x ) ) \boldsymbol{f} \sim \mathcal{GP}\left(\boldsymbol{\mu}_{\boldsymbol \alpha}(\mathbf{x}),\boldsymbol{\sigma}^2_{\boldsymbol \alpha}(\mathbf{x})\right) f ∼ G P ( μ α ( x ) , σ α 2 ( x ) )
Identity ¶ f ( x ; θ ) = x f(x;\theta) = x f ( x ; θ ) = x p ( y ∣ x , θ ) ∼ N ( y ∣ x , σ 2 ) p(y|x,\theta) \sim \mathcal{N}(y|x, \sigma^2) p ( y ∣ x , θ ) ∼ N ( y ∣ x , σ 2 ) L ( θ ) = − 1 2 σ 2 ∣ ∣ y − x ∣ ∣ 2 2 \mathcal{L}(\theta) = - \frac{1}{2\sigma^2}||y - x||_2^2 L ( θ ) = − 2 σ 2 1 ∣∣ y − x ∣ ∣ 2 2 Linear ¶ A linear function of w \mathbf{w} w wrt x \mathbf{x} x .
f ( x ; θ ) = w ⊤ x f(x;\theta) = w^\top x f ( x ; θ ) = w ⊤ x p ( y ∣ x , θ ) ∼ N ( y ∣ w ⊤ x , σ 2 ) p(y|x,\theta) \sim \mathcal{N}(y|w^\top x, \sigma^2) p ( y ∣ x , θ ) ∼ N ( y ∣ w ⊤ x , σ 2 ) L ( θ ) = − 1 2 σ 2 ∣ ∣ y − w ⊤ x ∣ ∣ 2 2 \mathcal{L}(\theta) = - \frac{1}{2\sigma^2}|| y - w^\top x||_2^2 L ( θ ) = − 2 σ 2 1 ∣∣ y − w ⊤ x ∣ ∣ 2 2 Basis Function ¶ A linear function of w \mathbf{w} w wrt to the basis function ϕ ( x ) \phi(x) ϕ ( x ) .
f ( x ; θ ) = w ⊤ ϕ ( x ; θ ) f(x;\theta) = w^\top \phi(x;\theta) f ( x ; θ ) = w ⊤ ϕ ( x ; θ ) Examples
ϕ ( x ) = ( 1 , x , x 2 , … ) \phi(x) = (1, x, x^2, \ldots) ϕ ( x ) = ( 1 , x , x 2 , … )
ϕ ( x ) = tanh ( x + γ ) α \phi(x) = \tanh(x + \gamma)^\alpha ϕ ( x ) = tanh ( x + γ ) α
ϕ ( x ) = exp ( − γ ∣ ∣ x − y ∣ ∣ 2 2 ) \phi(x) = \exp(- \gamma||x-y||_2^2) ϕ ( x ) = exp ( − γ ∣∣ x − y ∣ ∣ 2 2 )
ϕ ( x ) = [ sin ( 2 π ω x ) , cos ( 2 π ω x ) ] ⊤ \phi(x) = \left[\sin(2\pi\boldsymbol{\omega}\mathbf{x}),\cos(2\pi\boldsymbol{\omega}\mathbf{x}) \right]^\top ϕ ( x ) = [ sin ( 2 π ω x ) , cos ( 2 π ω x ) ] ⊤
Prob Formulation
p ( y ∣ x , θ ) ∼ N ( y ∣ w ⊤ ϕ ( x ) , σ 2 ) p(y|x,\theta) \sim \mathcal{N}(y|w^\top \phi(x), \sigma^2) p ( y ∣ x , θ ) ∼ N ( y ∣ w ⊤ ϕ ( x ) , σ 2 ) Likelihood Loss
L ( θ ) = − 1 2 σ 2 ∣ ∣ y − w ⊤ ϕ ( x ; θ ) ∣ ∣ 2 2 \mathcal{L}(\theta) = - \frac{1}{2 \sigma^2} ||y - w^\top \phi(x; \theta) ||_2^2 L ( θ ) = − 2 σ 2 1 ∣∣ y − w ⊤ ϕ ( x ; θ ) ∣ ∣ 2 2 Non-Linear Function ¶ A non-linear function in x \mathbf{x} x and w \mathbf{w} w .
f ( x ; θ ) = g ( w ⊤ ϕ ( x ; θ ϕ ) ) f(x; \theta) = g (w^\top \phi (x; \theta_{\phi})) f ( x ; θ ) = g ( w ⊤ ϕ ( x ; θ ϕ )) Examples
Random Forests
Neural Networks
Gradient Boosting
Prob Formulation
p ( y ∣ x , θ ) ∼ N ( y ∣ w ⊤ ϕ ( x ) , σ 2 ) p(y|x,\theta) \sim \mathcal{N}(y|w^\top \phi(x), \sigma^2) p ( y ∣ x , θ ) ∼ N ( y ∣ w ⊤ ϕ ( x ) , σ 2 ) Likelihood Loss
L ( θ ) = − 1 2 σ 2 ∣ ∣ y − g ( w ⊤ ϕ ( x ) ) ∣ ∣ 2 2 \mathcal{L}(\theta) = - \frac{1}{2\sigma^2}||y - g(w^\top \phi(x))||_2^2 L ( θ ) = − 2 σ 2 1 ∣∣ y − g ( w ⊤ ϕ ( x )) ∣ ∣ 2 2 Generic ¶ A non-linear function in x \mathbf{x} x and w \mathbf{w} w .
y = f ( x ; θ ) y = f(x; \theta) y = f ( x ; θ ) Examples
Random Forests
Neural Networks
Gradient Boosting
Prob Formulation
p ( y ∣ x , θ ) ∼ N ( y ∣ f ( x ; θ ) , σ 2 ) p(y|x,\theta) \sim \mathcal{N}(y|f(x; \theta), \sigma^2) p ( y ∣ x , θ ) ∼ N ( y ∣ f ( x ; θ ) , σ 2 ) Likelihood Loss
L ( θ ) = − 1 2 σ 2 ∣ ∣ y − f ( x ; θ ) ∣ ∣ 2 2 \mathcal{L}(\theta) = - \frac{1}{2\sigma^2}||y - f(x; \theta)||_2^2 L ( θ ) = − 2 σ 2 1 ∣∣ y − f ( x ; θ ) ∣ ∣ 2 2 Generic (Heteroscedastic) ¶ A non-linear function in x \mathbf{x} x and w \mathbf{w} w .
y = μ ( x ; θ ) + σ 2 ( x ; θ ) y = \boldsymbol{\mu}(x; \theta) + \boldsymbol{\sigma}^2(x;\theta) y = μ ( x ; θ ) + σ 2 ( x ; θ ) Examples
Random Forests
Neural Networks
Gradient Boosting
Prob Formulation
p ( y ∣ x , θ ) ∼ N ( y ∣ μ ( x ; θ ) , σ 2 ( x ; θ ) ) p(y|x,\theta) \sim \mathcal{N}(y|\boldsymbol{\mu}(x; \theta), \boldsymbol{\sigma}^2(x; \theta)) p ( y ∣ x , θ ) ∼ N ( y ∣ μ ( x ; θ ) , σ 2 ( x ; θ )) Likelihood Loss
− log p ( y ∣ x , θ ) = 1 2 log σ 2 ( x ; θ ) + 1 2 ∣ ∣ y − μ ( x ; θ ) ∣ ∣ σ 2 ( x ; θ ) 2 + C -\log p(y|x,\theta) = \frac{1}{2}\log \boldsymbol{\sigma}^2(\mathbf{x};\boldsymbol{\theta}) + \frac{1}{2}||\mathbf{y} - \boldsymbol{\mu}(\mathbf{x};\boldsymbol{\theta})||^2_{\boldsymbol{\sigma}^2(\mathbf{x};\boldsymbol{\theta})} + \text{C} − log p ( y ∣ x , θ ) = 2 1 log σ 2 ( x ; θ ) + 2 1 ∣∣ y − μ ( x ; θ ) ∣ ∣ σ 2 ( x ; θ ) 2 + C