Skip to content

Normalizing Flows

Main Idea

Distribution flows through a sequence of invertible transformations - Rezende & Mohamed (2015)

We want to fit a density model \(p_\theta(x)\) with continuous data \(x \in \mathbb{R}^N\). Ideally, we want this model to:

  • Modeling: Find the underlying distribution for the training data.
  • Probability: For a new \(x' \sim \mathcal{X}\), we want to be able to evaluate \(p_\theta(x')\)
  • Sampling: We also want to be able to generate samples from \(p_\theta(x')\).
  • Latent Representation: Ideally we want this representation to be meaningful.

Let's assume that we can find some probability distribution for \(\mathcal{X}\) but it's very difficult to do. So, instead of \(p_\theta(x)\), we want to find some parameterized function \(f_\theta(x)\) that we can learn.

\[z = f_\theta(x)\]

We want \(z\) to have certain properties:

  1. We want \(z\) to be defined by a probabilistic function and have a valid distribution \(z \sim p_\mathcal{Z}(z)\).
  2. We also prefer this distribution to be simple. We typically pick a normal distribution, \(z \sim \mathcal{N}(0,1)\).

We begin with an initial distribution and then apply a sequence of \(L\) invertible transformations to obtain something more expressive. This originally came from the context of Variational AutoEncoders (VAE) where the posterior was approximated by a neural network, and the authors wanted to enrich the approximate posterior beyond a simple Gaussian.

\[ \begin{aligned} \mathbf{z}_L = f_L \circ f_{L-1} \circ \ldots \circ f_2 \circ f_1 (\mathbf{z}_0) \end{aligned} \]

Loss Function

We can do a simple maximum-likelihood of our distribution \(p_\theta(x)\).

\[\underset{\theta}{\text{max}} \sum_i \log p_\theta(x^{(i)})\]

However, this expression needs to be transformed in terms of the invertible functions \(f_\theta(x)\). This is where we exploit the rule for the change of variables. From here, we can come up with an expression for the likelihood by simply calculating the maximum likelihood of the initial distribution \(\mathbf{z}_0\) given the transformations \(f_L\).

\[ \begin{aligned} p_\theta(x) = p_\mathcal{Z}(f_\theta(x)) \left| \frac{\partial f_\theta(x)}{\partial x} \right| \end{aligned} \]

So now, we can do the same maximization function but with our change of variables formulation:

\[ \begin{aligned} \underset{\theta}{\text{max}} \sum_i \log p_\theta(x^{(i)}) &= \underset{\theta}{\text{max}} \sum_i \log p_\mathcal{Z}\left(f_\theta(x^{(i)})\right) + \log \left| \frac{\partial f_\theta (x^{(i)})}{\partial x} \right| \end{aligned} \]

And we can optimize this using stochastic gradient descent (SGD) which means we can use all of the autogradient and deep learning libraries available to make this procedure relatively painless.

Sampling

If we want to sample from our base distribution \(z\), then we just need to use the inverse of our function.

\[x = f_\theta^{-1}(z)\]

where \(z \sim p_\mathcal{Z}(z)\). Remember, our \(f_\theta(\cdot)\) is invertible and differentiable so this should be no problem.


\[ \begin{aligned} q(z') = q(z) \left| \frac{\partial f}{\partial z} \right|^{-1} \end{aligned} \]

or the same but only in terms of the original distribution \(\mathcal{X}\)

We can make this transformation a bit easier to handle empirically by calculating the Log-Transformation of this expression. This removes the inverse and introduces a summation of each of the transformations individually which gives us many computational advantages.

\[ \begin{aligned} \log q_L (\mathbf{z}_L) = \log q_0 (\mathbf{z}_0) - \sum_{l=1}^L \log \left| \frac{\partial f_l}{\partial \mathbf{z}_l} \right| \end{aligned} \]

So now, our original expression with \(p_\theta(x)\) can be written in terms of \(z\).

In order to train this, we need to take expectations of the transformations.

\[ \begin{aligned} \mathcal{L}(\theta) &= \mathbb{E}_{q_0(\mathbf{z}_0)} \left[ \log p(\mathbf{x,z}_L)\right] - \mathbb{E}_{q_0(\mathbf{z}_0)} \left[ \log q_0(\mathbf{z}_0) \right] - \mathbb{E}_{q_0(\mathbf{z}_0)} \left[ \sum_{l=1}^L \log \text{det}\left| \frac{\partial f_l}{\partial \mathbf{z}_k} \right| \right] \end{aligned} \]

Choice of Transformations

The main thing that many of the communities have been looking into is how one chooses the aspects of the normalizing flow: the prior distribution and the Jacobian.

Prior Distribution

This is very consistent across the literature: most people use a fully-factorized Gaussian distribution. Very simple.

Jacobian

This is the area of the most research within the community. There are many different complicated frameworks but almost all of them can be put into different categories for how the Jacobian is constructed.

Resources

Best Tutorials


Survey of Literature

Neural Density Estimators

Neural density estimators use neural networks to directly parameterize the transformations in a normalizing flow. Key approaches include Masked Autoregressive Flows (MAF) and Inverse Autoregressive Flows (IAF), which exploit autoregressive structure for efficient density evaluation or sampling, respectively.

Deep Density Destructors

Density destructors take a complementary view: instead of transforming a simple distribution into a complex one, they iteratively transform a complex distribution toward uniformity. See the Deep Density Destructors notes for details.


Code Tutorials

  • Building Prob Dist with TF Probability Bijector API - Blog
  • Sculpting Distributions with Normalizing Flows - Blog

Tutorials

RBIG Upgrades

Cutting Edge

Github Implementations