Regression#
Bayesian Regression
Inference
Bayesian Regression#
Model#
In typical regression problems we have some data \(\mathcal{D}\) which consists of some input-output pairs \(X,y\). We wish to find a function \(f(\cdot)\) that maps the data \(X\) to \(y\). We also assume that there is some noise in the outputs \(\epsilon_y\). We can also have noise on the inputs \(X\) but we will discuss that at a later time. So concretely, we have: $\( \begin{aligned} y &= w \: x + \epsilon_y \\ \epsilon_y &\sim \mathcal{N}(0, \sigma_y^2) \end{aligned} \)$ Let’s demonstrate this by generating N data points from the true distribution.
As seen from the figure above, the points that we generated line somewhere along the true line. Of course, we are privvy to see the true like but an algorithm might have trouble with such few points. In addition, we can see the weight space is quite large as well. One thing we can do is maximize the likelihood that \(y\) comes from some normal distribution \(\mathcal{N}\) with some mean \(\mu\) and standard deviation \(\sigma^2\). $\( \mathcal{F} = \underset{w}{\text{max}} \log \mathcal{N} (y_i | w \: x_i, \sigma^2) \)$ So we will use the mean squared error (MSE) error as a loss function for our problem as maximizing the likelihood is equivalent to minimizing the MSE.
Proof: max MLE = min MSE
The likelihood of our model is:
And for simplicity, we assume the noise \(\epsilon\) comes from a Gaussian distribution and that it is constant. So we can rewrite our likelihood as
Plugging in the full formula for the Gaussian distribution with some simplifications gives us:
We can use the log rule \(\log ab = \log a + \log b\) to rewrite this expression to separate the constant term from the exponential. Also, \(\log e^x = x\).
So, the first term is constant so that we can ignore that in our loss function. We can do the same for the denominator for the second term. Let’s simplify it to make our life easier.
So we want to maximize this quantity: in other words, I want to find the parameter \(\mathbf{w}\) s.t. this equation is maximum.
We can rewrite this expression because the maximum of a negative quantity is the same as minimizing a positive quantity.
This is the same as the MSE error expression; with the edition of a scalar value \(1/N\).
Note: If we did not know \(\sigma_y^2\) then we would have to optimize this as well.