Gaussian Distributions#
Univariate Gaussian#
Multivariate Gaussian#
Joint Gaussian Distribution#
Lemma I - Conditional distribution of a Gaussian rv.#
Let’s define a joint Gaussian distribution for \(\mathbf{x,y}\).
We can write each of the marginal and conditional distributions just based on this joint distribution.
Lemma II - Linear Conditional Gaussian model.#
Take a rv \(\mathbf{x}\) which is Gaussian distributed
and take a rv \(\mathbf{y}\) which is a linear transformation of \(\mathbf{x}\) and is also Gaussian distributed. So we have
Since both distributions are Gaussian, we can write the joint distribution of \(p(\mathbf{x,y})\) which is also Gaussian.
This is Gaussian distributed, so we can write down the same equations using the above lemma. Let:
\(\boldsymbol{\Sigma}_\mathbf{x}=\boldsymbol{\Sigma}_\mathbf{x}\)
\(\boldsymbol{\Sigma}_\mathbf{xy}=\boldsymbol{\Sigma}_\mathbf{x}\mathbf{A}^\top\)
\(\boldsymbol{\Sigma}_\mathbf{y}=\mathbf{A}\boldsymbol{\Sigma}_\mathbf{x}\mathbf{A}^\top + \mathbf{R}\)
\(\boldsymbol{\Sigma}_\mathbf{xy}=\mathbf{A}\boldsymbol{\Sigma}_\mathbf{x}\)
\(\boldsymbol{\mu}_\mathbf{y}=\mathbf{A}\boldsymbol{\mu}_\mathbf{x}+\mathbf{b}\)
Marginal#
From the lemma we have:
where:
\(a = \boldsymbol{\mu}_\mathbf{x}\)
\(\mathbf{A}=\boldsymbol{\Sigma}_\mathbf{x}\)
Fortunately, this is a simple plug-in-play with no reductions.
Likelihood#
Take a Gaussian distribution with a full covariance matrix:
Mahalanobis Distance#
The Maholanobis Distance is given by:
We can write a simplified version in terms of the Euclidean norm.
Note: We often see this as a simplified representation of the covariance metric in the Gaussian likelihood function. This is even more apparent within the mean-squared loss functions as a simplified representation.
Log likelihood#
We can also write the log-likliehood of the Gaussian distribution. We simply take the \(\log\) of the RHS.
If we assume that \(\mathbf{x}\) is iid, we can rewrite this as a summation
Trace-Trick#
We can rewrite the distance function using the trace-trick.
Optimization#
Positivity#
Softplus#
Source: Ensembles
var_scaled = softplus(var) + 10e-6
Log variance#
var = log(0.5 * var) + 10e-6
Log Likelihoods#
Marginal Distribution \(\mathcal{P}(\cdot)\)#
We have the marginal distribution of \(x\)
and in integral form:
\(\mathcal{P}(x) = \int_y \mathcal{P}(x,y)dy\)
and we have the marginal distribution of \(y\)
Conditional Distribution \(\mathcal{P}(\cdot | \cdot)\)#
We have the conditional distribution of \(x\) given \(y\).
where:
\(\mu_{a|b} = a + BC^{-1}(y-b)\)
\(\Sigma_{a|b} = A - BC^{-1}B^T\)
and we have the marginal distribution of \(y\) given \(x\)
where:
\(\mu_{b|a} = b + AC^{-1}(x-a)\)
\(\Sigma_{b|a} = B - AC^{-1}A^T\)
basically mirror opposites of each other. But this might be useful to know later when we deal with trying to find the marginal distributions of Gaussian process functions.
Source:
Sampling from a Normal Distribution - blog
A really nice blog with nice plots of joint distributions.
Two was to derive the conditional distributions - stack
How to generate Gaussian samples = blog
Multivariate Gaussians and Detereminant - Lecturee Notes
Bandwidth Selection#
Scotts
sigma = np.power(n_samples, -1.0 / (d_dimensions + 4))
Silverman
sigma = np.power(n_samples * (d_dimensions + 2.0) / 4.0, -1.0 / (d_dimensions + 4)
Gaussian Distribution#
PDF#
Likelihood#
Alternative Representation#
where \(\mu\) is the mean function and \(\Sigma\) is the covariance. Let’s decompose \(\Sigma\) as with an eigendecomposition like so
Now we can represent our Normal distribution as:
where:
\(U\) is a rotation matrix
\(\Lambda^{-1/2}\) is a scale matrix
\(\mu\) is a translation matrix
\(Z \sim \mathcal{N}(0,I)\)
or also
where:
\(U\) is a rotation matrix
\(\Lambda\) is a scale matrix
\(\mu\) is a translation matrix
\(Z_n \sim \mathcal{N}(0,\Lambda)\)
Reparameterization#
So often in deep learning we will learn this distribution by a reparameterization like so:
where:
\(\mu \in \mathbb{R}^{d}\)
\(A \in \mathbb{R}^{d\times l}\)
\(Z_n \sim \mathcal{N}(0, I)\)
\(\Sigma=AA^\top\) - the cholesky decomposition
Entropy#
1 dimensional
D dimensional $\(H(X) = \frac{D}{2} + \frac{D}{2} \ln(2\pi) + \frac{1}{2}\ln|\Sigma|\)$
KL-Divergence (Relative Entropy)#
if \(\mu_1=\mu_0\) then:
Mutual Information
where \(\rho_0\) is the correlation matrix from \(\Sigma_0\).