Rotation-Based Iterative Gaussianization (RBIG)

Motivation
Algorithm
Marginal (Univariate) Gaussianization
- Marginal Uniformization
- Gaussianization of a Uniform Variable
Linear Transformation
Information Theory Measures
Information
Entropy
Mutual Information
KL-Divergence

Motivation

The RBIG algorithm is a member of the density destructor family of methods. A density destructor is a generative model that seeks to transform your original data distribution, \(\mathcal{X}\) to a base distribution, \(\mathcal{Z}\) through an invertible transformation \(\mathcal{G}_\theta\), parameterized by \(\theta\).

\[\begin{aligned} x &\sim \mathcal P_x \sim \text{Data Distribution}\\ \hat z &= \mathcal{G}_\theta(x) \sim \text{Approximate Base Distribution} \end{aligned}\]

Because we have invertible transforms, we can use the change of variables formula to get probability estimates of our original data space, \(\mathcal{X}\) using our base distribution \(\mathcal{Z}\). This is a well known formula written as:

\[p_x(x)= p_{z}\left( \mathcal{G}_{\theta}(x) \right) \left| \frac{\partial \mathcal{G}_{\theta}(x)}{\partial x} \right| = p_{z}\left( \mathcal{G}_{\theta}(x) \right) \left| \nabla_x \mathcal{G}(x) \right| \]

If you are familiar with normalizing flows, you'll find some similarities between the formulations. Inherently, they are the same. However, most (if not all) major normalizing flow methods focus on log-likelihood estimation of data \(\mathcal{X}\) by minimizing the log-determinant of the Jacobian as a cost function. RBIG is different in this regard as it has a different objective: to maximize the negentropy or equivalently minimize the total correlation.

Essentially, RBIG is an algorithm that embodies the density destructor philosophy. By destroying the density, we maximize the entropy and remove all redundancies within the marginals of the variables in question. This formulation allows us to utilize RBIG to calculate many other IT measures which we highlight below.

\[\begin{aligned} z &\sim \mathcal{P}_z \sim \text{Base Distribution}\\ \hat x &= \mathcal{G}_\theta(x) \sim \text{Approximate Data Distribution} \end{aligned}\]

Algorithm

Gaussianization - Given a random variable \(\mathbf{x} \in \mathbb{R}^d\), a Gaussianization transform is an invertible and differentiable transform \(\mathcal{G}(\mathbf{x})\) s.t. \(\mathcal{G}(\mathbf{x}) \sim \mathcal{N}(0, \mathbf{I})\).

\[\mathcal{G}:\mathbf{x}^{(k+1)}=\mathbf{R}_{(k)}\cdot \mathbf{\Psi}_{(k)}\left( \mathbf{x}^{(k)} \right)\]

where: * \(\mathbf{\Psi}_{(k)}\) is the marginal Gaussianization of each dimension of \(\mathbf{x}^{(k)}\) for the corresponding iteration. * \(\mathbf{R}_{(k)}\) is the rotation matrix for the marginally Gaussianized variable \(\mathbf{\Psi}_{(k)}\left( \mathbf{x}^{(k)} \right)\)

Marginal (Univariate) Gaussianization

This transformation is the \(\mathbf{\Psi}_\theta\) step for the RBIG algorithm.

In theory, to go from any distribution to a Gaussian distribution, we need to apply the following steps.

To go from \(\mathcal P \rightarrow \mathcal G\):

Convert the data to a Uniform distribution \(\mathcal U\) using the empirical CDF.
Apply the inverse Gaussian CDF to map from \(\mathcal U\) to \(\mathcal G\).

So to break this down even further, we need two key components:

Marginal Uniformization

We have to estimate the PDF of the marginal distribution of \(\mathbf x\). Then using the CDF of that estimated distribution, we can compute the uniform

\[u=U_k(x^k)=\int_{-\infty}^{x^k} p(x^k) \, dx^k\]

This boils down to estimating the histogram of the data distribution in order to get some probability distribution. I can think of a few ways to do this but the simplest is using the histogram function. Then convert it to a scipy stats rv where we will have access to functions like pdf and cdf. One nice trick is to add something to make the transformation smooth to ensure all samples are within the boundaries.

Example Implementation - ddl/univariate.py

From there, we just need the CDF of the univariate function \(u(\mathbf x)\). We can use the ppf function (the inverse CDF) in scipy.

Gaussianization of a Uniform Variable

Once we have a uniform variable \(u\), we apply the inverse CDF of the Gaussian distribution to obtain a Gaussian variable:

\[z = \Phi^{-1}(u)\]

where \(\Phi^{-1}\) is the inverse CDF (quantile function) of the standard normal distribution \(\mathcal{N}(0,1)\).

Linear Transformation

This is the \(\mathcal R_\theta\) step in the RBIG algorithm. We take some data \(\mathbf x_i\) and apply some rotation to that data \(\mathcal R_\theta (\mathbf x_i)\). This rotation is somewhat flexible provided that it follows a few criteria:

Orthogonal
Orthonormal
Invertible

So a few options that have been implemented include:

Independent Components Analysis (ICA)
Principal Components Analysis (PCA)
Random Rotations (random)

We would like to extend this framework to include more options, e.g.

Convolutions (conv)
Orthogonally Initialized Components (dct)

The whole transformation process goes as follows:

\[\mathcal P \rightarrow \mathbf W \cdot \mathcal P \rightarrow U \rightarrow G\]

Where we have the following spaces:

\(\mathcal P\) - the data space for \(\mathcal X\).
\(\mathbf W \cdot \mathcal P\) - The transformed space.
\(\mathcal U\) - The Uniform space.
\(\mathcal G\) - The Gaussian space.

Information Theory Measures

Information

See Information Theory Measures for details.

Entropy

The entropy of the data decreases at each RBIG iteration as the distribution approaches a multivariate Gaussian. The total entropy can be estimated by summing the entropy reductions across all iterations. See Information Theory Measures for the general definition.

Mutual Information

MI using RBIG — **Fig 2**: Schematic for finding the Mutual Information using RBIG.

\[ \begin{aligned} I(\mathbf{x,y}) &= T\left( \left[ \mathcal{G}_\theta (\mathbf{X}), \mathcal{G}_\phi (\mathbf{Y}) \right] \right) \end{aligned} \]

KL-Divergence

Let \(\mathcal{G}_\theta (\mathbf{X})\) be the Gaussianization of the variable \(\mathbf{X}\) which is parameterized by \(\theta\).

\[ \begin{aligned} D_\text{KL}\left[ \mathbf{X||Y} \right] &= D_\text{KL}\left[ \mathbf{X ||\mathcal{G}_\theta(Y)} \right] \\ &= J\left[ \mathcal{G}_\theta (\mathbf{\hat{y}}) \right] \end{aligned} \]