Models¶
Model
y=f(x;θ)+ϵ Measurement Model
p(y∣x;θ)∼N(y∣f(x;θ),σ2) Likelihood Loss Function
logp(y∣x;θ) Loss Function
L(θ)=−2σ21∣∣y−f(x;θ)∣∣22−logp(x;θ)
Error Minimization¶
We choose an error measure or loss function, L, to minimize wrt the parameters, θ.
L(θ)=i=1∑N[f(xi;θ)−y]2 We typically add some sort of regularization in order to constrain the solution
L(θ)=i=1∑N[f(xi;θ)−y]2+λR(θ)
Probabilistic Approach¶
We explicitly account for noise in our model.
y=f(x;θ)+ϵ(x) where ε is the noise. The simplest noise assumption we see in many approaches is the iid Gaussian noise.
ϵ(x)∼N(0,σ2) So given our standard Bayesian formulation for the posterior
p(θ∣D)∝p(y∣x,θ)p(θ) we assume a Gaussian observation model
p(y∣x,θ)=N(y;f(x;θ),σ2) and in turn a likelihood model
p(y∣x;θ)=i=1∏NN(yi;f(xi;θ),σ2) Objective: maximize the likelihood of the data, D wrt the parameters, θ.
Note: For a Gaussian noise model (what we have assumed above), this approach will use the same predictions as the MSE loss function (that we saw above).
logp(y∣x,θ)∝−2σ21i=1∑N[yi−f(xi;θ)] We can simplify the notion a bit to make it more compact. This essentially puts all of the observations together so that we can use vectorized representations, i.e. D={xi,yi}i=1N
logp(y∣x,θ)=−2σ21(y−f(x;θ))⊤(y−f(x;θ))=−2σ21∣∣y−f(x;θ)∣∣22 where ∣∣⋅∣∣22 is the Maholanobis Distance.
Note: we often see this notation in many papers and books.
Priors¶
Different Parameterizations¶
Model | Equation |
---|
Identity | x |
Linear | wx+b |
Basis | wϕ(x)+b |
Non-Linear | σ(wx+b) |
Neural Network | fL∘fL−1∘…∘f1 |
Functional | f∼GP(μα(x),σα2(x)) |
Identity¶
f(x;θ)=x p(y∣x,θ)∼N(y∣x,σ2) L(θ)=−2σ21∣∣y−x∣∣22
Linear¶
A linear function of w wrt x.
f(x;θ)=w⊤x p(y∣x,θ)∼N(y∣w⊤x,σ2) L(θ)=−2σ21∣∣y−w⊤x∣∣22
Basis Function¶
A linear function of w wrt to the basis function ϕ(x).
f(x;θ)=w⊤ϕ(x;θ) Examples
- ϕ(x)=(1,x,x2,…)
- ϕ(x)=tanh(x+γ)α
- ϕ(x)=exp(−γ∣∣x−y∣∣22)
- ϕ(x)=[sin(2πωx),cos(2πωx)]⊤
Prob Formulation
p(y∣x,θ)∼N(y∣w⊤ϕ(x),σ2) Likelihood Loss
L(θ)=−2σ21∣∣y−w⊤ϕ(x;θ)∣∣22
Non-Linear Function¶
A non-linear function in x and w.
f(x;θ)=g(w⊤ϕ(x;θϕ)) Examples
- Random Forests
- Neural Networks
- Gradient Boosting
Prob Formulation
p(y∣x,θ)∼N(y∣w⊤ϕ(x),σ2) Likelihood Loss
L(θ)=−2σ21∣∣y−g(w⊤ϕ(x))∣∣22
Generic¶
A non-linear function in x and w.
y=f(x;θ) Examples
- Random Forests
- Neural Networks
- Gradient Boosting
Prob Formulation
p(y∣x,θ)∼N(y∣f(x;θ),σ2) Likelihood Loss
L(θ)=−2σ21∣∣y−f(x;θ)∣∣22
Generic (Heteroscedastic)¶
A non-linear function in x and w.
y=μ(x;θ)+σ2(x;θ) Examples
- Random Forests
- Neural Networks
- Gradient Boosting
Prob Formulation
p(y∣x,θ)∼N(y∣μ(x;θ),σ2(x;θ)) Likelihood Loss
−logp(y∣x,θ)=21logσ2(x;θ)+21∣∣y−μ(x;θ)∣∣σ2(x;θ)2+C