SurVAE flows, Gaussianization, and likelihood accounting

This note gives a deliberately slow proof of the likelihood rules behind SurVAE flows from the perspective of Gaussianization. The goal is to explain why ordinary normalizing flows, surjective transformations, and stochastic VAE-like transformations can all be treated as composable layers with local likelihood contributions.

The main references are SurVAE Flows Nielsen et al., 2020, Gaussianization Flows Meng et al., 2020, iterative Gaussianization Laparra et al., 2011, and the standard normalizing flow / VAE literature Rezende & Mohamed, 2015Dinh et al., 2017Kingma & Welling, 2014.

1. Gaussianization as density estimation¶

Let

x \in \mathcal X, \qquad z \in \mathcal Z, \qquad p_Z(z)=\mathcal N(z;0,I).

((1))

In Gaussianization, we learn a map

T:\mathcal X\to\mathcal Z, \qquad z=T(x),

((2))

so that the transformed data look approximately standard Gaussian:

z=T(x)\sim \mathcal N(0,I).

((3))

If $T$ is bijective and differentiable, the likelihood follows from the ordinary change-of-variables formula:

p_X(x)=p_Z(T(x))\left|\det J_T(x)\right|.

((4))

Equivalently,

\log p_X(x)=\log p_Z(T(x)) + \log\left|\det J_T(x)\right|.

((5))

For a composition of bijective Gaussianization layers,

x=x_0 \mapsto x_1 \mapsto \cdots \mapsto x_K=z,

((6))

we get

\log p_X(x) = \log p_Z(x_K) + \sum_{k=1}^K \log\left|\det J_{T_k}(x_{k-1})\right|.

((7))

This is the classical normalizing-flow story.

SurVAE flows ask: what if some useful transformations are not bijections?

Examples include sorting, absolute value, max pooling, slicing, augmentation, dequantization, periodic wrapping, and VAE-style stochastic maps. These are natural in image modeling, representation learning, and geoscience, where many forward operators lose information.

2. The universal latent-variable identity¶

Start from the marginal likelihood identity

p_X(x)=\int p_{X,Z}(x,z)\,dz.

((8))

Factor the joint distribution generatively:

p_{X,Z}(x,z)=p_Z(z)p_{X\mid Z}(x\mid z).

((9))

Then

p_X(x)=\int p_Z(z)p_{X\mid Z}(x\mid z)\,dz.

((10))

Now introduce any auxiliary inverse or inference density

q_{Z\mid X}(z\mid x),

((11))

assuming it is positive wherever the integrand is positive. Then

p_X(x) = \int q_{Z\mid X}(z\mid x) \frac{p_Z(z)p_{X\mid Z}(x\mid z)}{q_{Z\mid X}(z\mid x)}\,dz.

((12))

Taking logs gives

\log p_X(x) = \log \mathbb E_{q(z\mid x)} \left[ \frac{p_Z(z)p_{X\mid Z}(x\mid z)}{q_{Z\mid X}(z\mid x)} \right].

((13))

By Jensen’s inequality,

\log p_X(x) \ge \mathbb E_{q(z\mid x)} \left[ \log p_Z(z) + \log p_{X\mid Z}(x\mid z) - \log q_{Z\mid X}(z\mid x) \right].

((14))

A VAE uses this lower bound directly. A bijective normalizing flow is a special case where the bound is exact because the inverse is deterministic and unique. SurVAE flows organize many transformation types under this same accounting system.

3. Bijective transformations¶

Assume

z=T(x), \qquad x=T^{-1}(z),

((15))

where $T:\mathbb R^D\to\mathbb R^D$ is a differentiable bijection.

3.1 Volume-element proof¶

For a small region $A\subset \mathcal X$ ,

\mathbb P(x\in A)=\mathbb P(z\in T(A)).

((16))

Locally,

dz=\left|\det J_T(x)\right|dx.

((17))

Therefore,

p_X(x)dx=p_Z(z)dz.

((18))

Substitute $z=T(x)$ :

p_X(x)dx=p_Z(T(x))\left|\det J_T(x)\right|dx.

((19))

Cancel $dx$ :

p_X(x)=p_Z(T(x))\left|\det J_T(x)\right|.

((20))

Thus,

\log p_X(x)=\log p_Z(T(x))+\log\left|\det J_T(x)\right|.

((21))

3.2 Dirac-delta proof¶

Now write the generative direction as

z\sim p_Z(z), \qquad x=f(z),

((22))

where $f=T^{-1}$ . Since $x$ is deterministic given $z$ , the conditional density is a Dirac delta:

p_{X\mid Z}(x\mid z)=\delta(x-f(z)).

((23))

Hence

p_X(x)=\int p_Z(z)\delta(x-f(z))\,dz.

((24))

Because $f$ is bijective, the equation

x=f(z)

((25))

has exactly one solution

z=f^{-1}(x)=T(x).

((26))

The multivariate delta identity gives

\delta(x-f(z)) = \frac{\delta(z-f^{-1}(x))} {\left|\det J_f(f^{-1}(x))\right|}.

((27))

Therefore,

p_X(x) = \int p_Z(z) \frac{\delta(z-f^{-1}(x))} {\left|\det J_f(f^{-1}(x))\right|} \,dz.

((28))

The denominator is constant with respect to $z$ , so

p_X(x) = \frac{1} {\left|\det J_f(f^{-1}(x))\right|} \int p_Z(z)\delta(z-f^{-1}(x))\,dz.

((29))

Using the sifting property of the delta function,

\int p_Z(z)\delta(z-f^{-1}(x))\,dz = p_Z(f^{-1}(x)).

((30))

Thus

p_X(x) = \frac{p_Z(f^{-1}(x))} {\left|\det J_f(f^{-1}(x))\right|}.

((31))

Since $T=f^{-1}$ ,

\left|\det J_T(x)\right| = \frac{1} {\left|\det J_f(f^{-1}(x))\right|},

((32))

p_X(x)=p_Z(T(x))\left|\det J_T(x)\right|.

((33))

4. Bijections as degenerate VAEs¶

A bijective flow can be written as a latent-variable model with deterministic encoder and decoder:

q_{Z\mid X}(z\mid x)=\delta(z-T(x)),

((34))

and

p_{X\mid Z}(x\mid z)=\delta(x-T^{-1}(z)).

((35))

There is no posterior uncertainty because each $x$ corresponds to exactly one $z$ . Therefore, the variational lower bound is tight. This is why normalizing flows give exact likelihoods.

5. Surjective transformations¶

A map

f:\mathcal Z\to\mathcal X

((36))

is surjective if every $x\in\mathcal X$ has at least one preimage, but possibly many:

f^{-1}(x)=\{z:f(z)=x\}.

((37))

Generatively,

z\sim p_Z(z), \qquad x=f(z).

((38))

The forward map is deterministic, but the inverse is ambiguous.

Examples:

x=|z|,

((39))

where $z=x$ and $z=-x$ both map to the same value;

x=\operatorname{sort}(z),

((40))

where all permutations of $z$ map to the same sorted vector; and

x=\operatorname{slice}(z),

((41))

where some coordinates are discarded.

6. Exact likelihood for finite-to-one surjections¶

Assume $f:\mathbb R^D\to\mathbb R^D$ is many-to-one but locally invertible on branches. Let the domain decompose into branches

\mathcal Z=\bigcup_k \mathcal Z_k,

((42))

and let

f_k:\mathcal Z_k\to\mathcal X

((43))

be bijective on each branch. For a given $x$ , define

z_k=f_k^{-1}(x).

((44))

Start again from the delta representation:

p_X(x)=\int p_Z(z)\delta(x-f(z))\,dz.

((45))

Split the integral over branches:

p_X(x)= \sum_k \int_{\mathcal Z_k}p_Z(z)\delta(x-f_k(z))\,dz.

((46))

On each branch,

\delta(x-f_k(z)) = \frac{\delta(z-z_k)}{\left|\det J_{f_k}(z_k)\right|}.

((47))

Therefore,

p_X(x)= \sum_k \frac{p_Z(z_k)}{\left|\det J_{f_k}(z_k)\right|}.

((48))

Equivalently,

p_X(x)= \sum_{z\in f^{-1}(x)} p_Z(z) \left|\det J_{f^{-1}_{\text{branch}}}(x)\right|.

((49))

This is exact, but the sum may be expensive. Sorting has up to $D!$ branches, for example.

7. Worked example: absolute value¶

Let

x=|z|, \qquad z\in\mathbb R, \qquad x\in[0,\infty).

((50))

For $x>0$ ,

f^{-1}(x)=\{x,-x\}.

((51))

The derivative magnitude is 1 on both branches, so

p_X(x)=p_Z(x)+p_Z(-x).

((52))

Now introduce a stochastic inverse:

q(z=x\mid x)=q_+(x), \qquad q(z=-x\mid x)=q_-(x),

((53))

with

q_+(x)+q_-(x)=1.

((54))

Then Jensen’s inequality gives

\log p_X(x) \ge \mathbb E_{q(z\mid x)} \left[ \log p_Z(z)-\log q(z\mid x) \right].

((55))

Expanding the expectation,

\mathcal L(x)= q_+(x)\left[\log p_Z(x)-\log q_+(x)\right] + q_-(x)\left[\log p_Z(-x)-\log q_-(x)\right].

((56))

The bound is tight when $q(z\mid x)$ equals the true posterior over branches:

p(z=x\mid x)= \frac{p_Z(x)}{p_Z(x)+p_Z(-x)},

((57))

and

p(z=-x\mid x)= \frac{p_Z(-x)}{p_Z(x)+p_Z(-x)}.

((58))

8. Worked example: slicing and augmentation¶

Let

z=(x,u),

((59))

and define a surjection that drops $u$ :

f(z)=x.

((60))

The exact likelihood is

p_X(x)=\int p_Z(x,u)\,du.

((61))

This integral may be intractable. Introduce an inverse distribution

u\sim q(u\mid x).

((62))

Then

p_X(x) = \int q(u\mid x)\frac{p_Z(x,u)}{q(u\mid x)}\,du.

((63))

Thus

\log p_X(x) \ge \mathbb E_{q(u\mid x)} \left[ \log p_Z(x,u)-\log q(u\mid x) \right].

((64))

This is the same algebra as the VAE ELBO, but now interpreted as a SurVAE surjection.

9. Stochastic transformations¶

A fully stochastic transformation has both an inference density and a generative density:

z\sim q_{Z\mid X}(z\mid x), \qquad x\sim p_{X\mid Z}(x\mid z).

((65))

The marginal likelihood is

p_X(x)=\int p_Z(z)p_{X\mid Z}(x\mid z)\,dz.

((66))

Usually this integral is intractable, giving the lower bound

\log p_X(x) \ge \mathbb E_{q(z\mid x)} \left[ \log p_Z(z) + \log p_{X\mid Z}(x\mid z) - \log q_{Z\mid X}(z\mid x) \right].

((67))

This is the VAE case. SurVAE’s contribution is to treat this as one layer type inside a larger compositional flow.

10. Layerwise likelihood bookkeeping¶

Consider a composition

x=x_0\to x_1\to\cdots\to x_K=z.

((68))

At the end, evaluate the base density

\log p_Z(z).

((69))

Each layer contributes a correction.

For a bijection,

\Delta_k= \log\left|\det J_{T_k}(x_{k-1})\right|.

((70))

For a stochastic or variational inverse layer,

\Delta_k= \log p_k(x_{k-1}\mid x_k)-\log q_k(x_k\mid x_{k-1}),

((71))

with deterministic delta/Jacobian terms handled analytically when present.

So the total exact likelihood or lower bound has the form

\log p_X(x) \gtrsim \log p_Z(z)+\sum_{k=1}^K\Delta_k.

((72))

The symbol $\gtrsim$ means exact equality for fully exact transformations and a lower bound when stochastic inverses or variational approximations are used.

11. Connection back to Gaussianization¶

Classical Gaussianization says

x\mapsto z\sim \mathcal N(0,I)

((73))

using invertible transformations. SurVAE-style Gaussianization says the map may include operations that are useful but not invertible.

Examples:

periodic wrapping canonicalizes angles but loses winding number;
sorting canonicalizes permutation symmetry but loses the original order;
pooling summarizes local patches but loses sub-patch detail;
slicing projects from a higher-dimensional latent representation to observed coordinates;
dequantization maps discrete observations into continuous latent variables.

For geoscience, this is natural. Many observation operators are not bijections:

\text{high-resolution field}\mapsto \text{coarse-resolution field},

((74))

\text{3D atmospheric state}\mapsto \text{2D column observation},

((75))

\text{radiance spectrum}\mapsto \text{retrieved methane column},

((76))

\text{continuous field}\mapsto \text{quantized satellite product}.

((77))

These transformations lose information. SurVAE flows provide a density-estimation language for this situation: keep exact likelihoods when possible, introduce stochastic inverses when necessary, and track the resulting lower bound.

12. Summary¶

Transformation	Forward behavior	Inverse behavior	Likelihood accounting
Bijection	one-to-one	deterministic	exact change of variables
Surjection	many-to-one	branch sum or stochastic inverse	exact if summed; ELBO if sampled
Stochastic	random	stochastic	variational lower bound

The shortest useful mental model is

\boxed{\text{normalizing flows} = \text{Gaussianization by invertible transport}}

((78))

and

\boxed{\text{SurVAE flows} = \text{Gaussianization by transport plus controlled information loss/addition}.}

((79))

The Dirac delta proof is the bridge: it shows how deterministic transformations can be written as conditional densities, and how their likelihood corrections come from enforcing constraints and correcting volume.

References¶

Nielsen, D., Jaini, P., Hoogeboom, E., Winther, O., & Welling, M. (2020). SurVAE Flows: Surjections to Bridge the Gap between VAEs and Flows. Advances in Neural Information Processing Systems (NeurIPS), 33, 12685–12696. https://proceedings.neurips.cc/paper/2020/hash/9578a63fbe545bd82cc5bbe749636af1-Abstract.html
Meng, C., Song, Y., Song, J., & Ermon, S. (2020). Gaussianization Flows. arXiv:2003.01941. https://arxiv.org/abs/2003.01941
Laparra, V., Camps-Valls, G., & Malo, J. (2011). Iterative Gaussianization: From ICA to Random Rotations. IEEE Transactions on Neural Networks, 22(4), 537–549. 10.1109/TNN.2011.2106511
Rezende, D. J., & Mohamed, S. (2015). Variational Inference with Normalizing Flows. International Conference on Machine Learning (ICML).
Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2017). Density Estimation using Real NVP. International Conference on Learning Representations (ICLR).
Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. International Conference on Learning Representations (ICLR).