An overview of the methods used to find the best parameters.
Inference ¶ In general, there are three different ways to get the parameters.
Point Estimates ¶ We assume that the posterior distribution is proportional to the decomposition of the joint distribution and ignore the normalization constant.
p ( θ ∣ y ) ∝ p ( y ∣ θ ) p ( θ ) p(\boldsymbol{\theta}|\mathbf{y}) \propto
p(\mathbf{y}|\boldsymbol{\theta})p(\boldsymbol{\theta}) p ( θ ∣ y ) ∝ p ( y ∣ θ ) p ( θ ) Thus, we will acquire an approximate estimate of the parameters given the measurements.
To minimize this, we will simply
L ( θ ) = argmin θ ∑ n = 1 N log p ( θ ∣ y n ) \boldsymbol{L}(\boldsymbol{\theta}) = \underset{\boldsymbol{\theta}}{\text{argmin}} \hspace{2mm}
\sum_{n=1}^N\log p(\boldsymbol{\theta}|\mathbf{y}_n) L ( θ ) = θ argmin n = 1 ∑ N log p ( θ ∣ y n )
We still need to find the parameters whichever methodology we use.
θ ∗ = argmin θ L ( θ ) \boldsymbol{\theta}^* = \underset{\boldsymbol{\theta}}{\text{argmin}}
\hspace{2mm}
\boldsymbol{L}(\boldsymbol{\theta}) θ ∗ = θ argmin L ( θ ) This requires one to iterate until convergence
Initial Parameters : θ 0 = … Initial Optimization State : h 0 = … Optimization Step : θ ( k ) , h ( k ) = g ( θ ( k − 1 ) , h ( k − 1 ) , α ) \begin{aligned}
\text{Initial Parameters}: && &&
\boldsymbol{\theta}_0 &= \ldots \\
\text{Initial Optimization State}: && &&
\mathbf{h}_0 &= \ldots \\
\text{Optimization Step}: && &&
\boldsymbol{\theta}^{(k)}, \mathbf{h}^{(k)} &= \boldsymbol{g}(\boldsymbol{\theta}^{(k-1)}, \mathbf{h}^{(k-1)}, \boldsymbol{\alpha}) \\
\end{aligned} Initial Parameters : Initial Optimization State : Optimization Step : θ 0 h 0 θ ( k ) , h ( k ) = … = … = g ( θ ( k − 1 ) , h ( k − 1 ) , α ) For these methods, we use a highlevel optimizer for solving unconstrained problems.
In particular, we use the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm Fletcher, 2000 [wiki ].
It is a higher level optimization scheme which uses the Hessian matrix of the loss function, in this case it is the negative log-likelihood loss.
It is chosen because it offers very fast convergence for nonlinear optimization problems where higher level gradient information is available.
MLE Estimation ¶ L MLE ( θ ) = argmin θ ∑ n = 1 N log p ( y n ∣ θ ) \boldsymbol{L}_\text{MLE}(\boldsymbol{\theta}) = \underset{\boldsymbol{\theta}}{\text{argmin}} \hspace{2mm}
\sum_{n=1}^N\log p(y_n|\boldsymbol{\theta}) L MLE ( θ ) = θ argmin n = 1 ∑ N log p ( y n ∣ θ ) We put some constraints on the parameters.
The mean and shape parameters are allowed to be completely free however, the scale parameter is constrained to be positive.
Mean : μ ∈ R Scale : σ ∈ R + Shape : κ ∈ R \begin{aligned}
\text{Mean}: && &&
\mu &\in \mathbb{R} \\
\text{Scale}: && &&
\sigma &\in \mathbb{R}^+ \\
\text{Shape}: && &&
\kappa &\in \mathbb{R}
\end{aligned} Mean : Scale : Shape : μ σ κ ∈ R ∈ R + ∈ R MAP Estimation ¶ The MAP estiamtion is very similar to the MLE estimation except that we put priors on the parameters.
L MAP ( θ ) = argmin θ ∑ n = 1 N log p ( y n ∣ θ ) + log p ( θ ) \boldsymbol{L}_\text{MAP}(\boldsymbol{\theta}) = \underset{\boldsymbol{\theta}}{\text{argmin}} \hspace{2mm}
\sum_{n=1}^N\log p(y_n|\boldsymbol{\theta}) +
\log p(\boldsymbol{\theta}) L MAP ( θ ) = θ argmin n = 1 ∑ N log p ( y n ∣ θ ) + log p ( θ ) We put some prior distributions on the parameters.
The mean and shape parameters are allowed to be completely free however, the scale parameter is constrained to be positive.
Mean : μ ∼ Normal ( μ ^ , σ ^ ) Scale : σ ∼ LogNormal ( 0.5 σ ^ , 0.25 ) Shape : κ ∼ Normal ( κ ^ , 0.1 ) \begin{aligned}
\text{Mean}: && &&
\mu &\sim \text{Normal}(\hat{\mu},\hat{\sigma})\\
\text{Scale}: && &&
\sigma &\sim \text{LogNormal}(0.5\hat{\sigma}, 0.25)\\
\text{Shape}: && &&
\kappa &\sim \text{Normal}(\hat{\kappa}, 0.1)\\
\end{aligned} Mean : Scale : Shape : μ σ κ ∼ Normal ( μ ^ , σ ^ ) ∼ LogNormal ( 0.5 σ ^ , 0.25 ) ∼ Normal ( κ ^ , 0.1 ) The estimated parameters for the μ \mu μ are estimated directly from the data by calculating the mean and standard deviation.
We use the same estimated parameter
Approximate Inference ¶ Laplace Approximation (TODO) ¶ SVI (TODO) ¶ Sampling ¶ MCMC (TODO) ¶
Fletcher, R. (2000). Practical Methods of Optimization . Wiley. 10.1002/9781118723203