Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

SurVAE flows, Gaussianization, and likelihood accounting

This note gives a deliberately slow proof of the likelihood rules behind SurVAE flows from the perspective of Gaussianization. The goal is to explain why ordinary normalizing flows, surjective transformations, and stochastic VAE-like transformations can all be treated as composable layers with local likelihood contributions.

The main references are SurVAE Flows Nielsen et al., 2020, Gaussianization Flows Meng et al., 2020, iterative Gaussianization Laparra et al., 2011, and the standard normalizing flow / VAE literature Rezende & Mohamed, 2015Dinh et al., 2017Kingma & Welling, 2014.

1. Gaussianization as density estimation

Let

xX,zZ,pZ(z)=N(z;0,I).x \in \mathcal X, \qquad z \in \mathcal Z, \qquad p_Z(z)=\mathcal N(z;0,I).

In Gaussianization, we learn a map

T:XZ,z=T(x),T:\mathcal X\to\mathcal Z, \qquad z=T(x),

so that the transformed data look approximately standard Gaussian:

z=T(x)N(0,I).z=T(x)\sim \mathcal N(0,I).

If TT is bijective and differentiable, the likelihood follows from the ordinary change-of-variables formula:

pX(x)=pZ(T(x))detJT(x).p_X(x)=p_Z(T(x))\left|\det J_T(x)\right|.

Equivalently,

logpX(x)=logpZ(T(x))+logdetJT(x).\log p_X(x)=\log p_Z(T(x)) + \log\left|\det J_T(x)\right|.

For a composition of bijective Gaussianization layers,

x=x0x1xK=z,x=x_0 \mapsto x_1 \mapsto \cdots \mapsto x_K=z,

we get

logpX(x)=logpZ(xK)+k=1KlogdetJTk(xk1).\log p_X(x) = \log p_Z(x_K) + \sum_{k=1}^K \log\left|\det J_{T_k}(x_{k-1})\right|.

This is the classical normalizing-flow story.

SurVAE flows ask: what if some useful transformations are not bijections?

Examples include sorting, absolute value, max pooling, slicing, augmentation, dequantization, periodic wrapping, and VAE-style stochastic maps. These are natural in image modeling, representation learning, and geoscience, where many forward operators lose information.

2. The universal latent-variable identity

Start from the marginal likelihood identity

pX(x)=pX,Z(x,z)dz.p_X(x)=\int p_{X,Z}(x,z)\,dz.

Factor the joint distribution generatively:

pX,Z(x,z)=pZ(z)pXZ(xz).p_{X,Z}(x,z)=p_Z(z)p_{X\mid Z}(x\mid z).

Then

pX(x)=pZ(z)pXZ(xz)dz.p_X(x)=\int p_Z(z)p_{X\mid Z}(x\mid z)\,dz.

Now introduce any auxiliary inverse or inference density

qZX(zx),q_{Z\mid X}(z\mid x),

assuming it is positive wherever the integrand is positive. Then

pX(x)=qZX(zx)pZ(z)pXZ(xz)qZX(zx)dz.p_X(x) = \int q_{Z\mid X}(z\mid x) \frac{p_Z(z)p_{X\mid Z}(x\mid z)}{q_{Z\mid X}(z\mid x)}\,dz.

Taking logs gives

logpX(x)=logEq(zx)[pZ(z)pXZ(xz)qZX(zx)].\log p_X(x) = \log \mathbb E_{q(z\mid x)} \left[ \frac{p_Z(z)p_{X\mid Z}(x\mid z)}{q_{Z\mid X}(z\mid x)} \right].

By Jensen’s inequality,

logpX(x)Eq(zx)[logpZ(z)+logpXZ(xz)logqZX(zx)].\log p_X(x) \ge \mathbb E_{q(z\mid x)} \left[ \log p_Z(z) + \log p_{X\mid Z}(x\mid z) - \log q_{Z\mid X}(z\mid x) \right].

A VAE uses this lower bound directly. A bijective normalizing flow is a special case where the bound is exact because the inverse is deterministic and unique. SurVAE flows organize many transformation types under this same accounting system.

3. Bijective transformations

Assume

z=T(x),x=T1(z),z=T(x), \qquad x=T^{-1}(z),

where T:RDRDT:\mathbb R^D\to\mathbb R^D is a differentiable bijection.

3.1 Volume-element proof

For a small region AXA\subset \mathcal X,

P(xA)=P(zT(A)).\mathbb P(x\in A)=\mathbb P(z\in T(A)).

Locally,

dz=detJT(x)dx.dz=\left|\det J_T(x)\right|dx.

Therefore,

pX(x)dx=pZ(z)dz.p_X(x)dx=p_Z(z)dz.

Substitute z=T(x)z=T(x):

pX(x)dx=pZ(T(x))detJT(x)dx.p_X(x)dx=p_Z(T(x))\left|\det J_T(x)\right|dx.

Cancel dxdx:

pX(x)=pZ(T(x))detJT(x).p_X(x)=p_Z(T(x))\left|\det J_T(x)\right|.

Thus,

logpX(x)=logpZ(T(x))+logdetJT(x).\log p_X(x)=\log p_Z(T(x))+\log\left|\det J_T(x)\right|.

3.2 Dirac-delta proof

Now write the generative direction as

zpZ(z),x=f(z),z\sim p_Z(z), \qquad x=f(z),

where f=T1f=T^{-1}. Since xx is deterministic given zz, the conditional density is a Dirac delta:

pXZ(xz)=δ(xf(z)).p_{X\mid Z}(x\mid z)=\delta(x-f(z)).

Hence

pX(x)=pZ(z)δ(xf(z))dz.p_X(x)=\int p_Z(z)\delta(x-f(z))\,dz.

Because ff is bijective, the equation

x=f(z)x=f(z)

has exactly one solution

z=f1(x)=T(x).z=f^{-1}(x)=T(x).

The multivariate delta identity gives

δ(xf(z))=δ(zf1(x))detJf(f1(x)).\delta(x-f(z)) = \frac{\delta(z-f^{-1}(x))} {\left|\det J_f(f^{-1}(x))\right|}.

Therefore,

pX(x)=pZ(z)δ(zf1(x))detJf(f1(x))dz.p_X(x) = \int p_Z(z) \frac{\delta(z-f^{-1}(x))} {\left|\det J_f(f^{-1}(x))\right|} \,dz.

The denominator is constant with respect to zz, so

pX(x)=1detJf(f1(x))pZ(z)δ(zf1(x))dz.p_X(x) = \frac{1} {\left|\det J_f(f^{-1}(x))\right|} \int p_Z(z)\delta(z-f^{-1}(x))\,dz.

Using the sifting property of the delta function,

pZ(z)δ(zf1(x))dz=pZ(f1(x)).\int p_Z(z)\delta(z-f^{-1}(x))\,dz = p_Z(f^{-1}(x)).

Thus

pX(x)=pZ(f1(x))detJf(f1(x)).p_X(x) = \frac{p_Z(f^{-1}(x))} {\left|\det J_f(f^{-1}(x))\right|}.

Since T=f1T=f^{-1},

detJT(x)=1detJf(f1(x)),\left|\det J_T(x)\right| = \frac{1} {\left|\det J_f(f^{-1}(x))\right|},

so

pX(x)=pZ(T(x))detJT(x).p_X(x)=p_Z(T(x))\left|\det J_T(x)\right|.

4. Bijections as degenerate VAEs

A bijective flow can be written as a latent-variable model with deterministic encoder and decoder:

qZX(zx)=δ(zT(x)),q_{Z\mid X}(z\mid x)=\delta(z-T(x)),

and

pXZ(xz)=δ(xT1(z)).p_{X\mid Z}(x\mid z)=\delta(x-T^{-1}(z)).

There is no posterior uncertainty because each xx corresponds to exactly one zz. Therefore, the variational lower bound is tight. This is why normalizing flows give exact likelihoods.

5. Surjective transformations

A map

f:ZXf:\mathcal Z\to\mathcal X

is surjective if every xXx\in\mathcal X has at least one preimage, but possibly many:

f1(x)={z:f(z)=x}.f^{-1}(x)=\{z:f(z)=x\}.

Generatively,

zpZ(z),x=f(z).z\sim p_Z(z), \qquad x=f(z).

The forward map is deterministic, but the inverse is ambiguous.

Examples:

x=z,x=|z|,

where z=xz=x and z=xz=-x both map to the same value;

x=sort(z),x=\operatorname{sort}(z),

where all permutations of zz map to the same sorted vector; and

x=slice(z),x=\operatorname{slice}(z),

where some coordinates are discarded.

6. Exact likelihood for finite-to-one surjections

Assume f:RDRDf:\mathbb R^D\to\mathbb R^D is many-to-one but locally invertible on branches. Let the domain decompose into branches

Z=kZk,\mathcal Z=\bigcup_k \mathcal Z_k,

and let

fk:ZkXf_k:\mathcal Z_k\to\mathcal X

be bijective on each branch. For a given xx, define

zk=fk1(x).z_k=f_k^{-1}(x).

Start again from the delta representation:

pX(x)=pZ(z)δ(xf(z))dz.p_X(x)=\int p_Z(z)\delta(x-f(z))\,dz.

Split the integral over branches:

pX(x)=kZkpZ(z)δ(xfk(z))dz.p_X(x)= \sum_k \int_{\mathcal Z_k}p_Z(z)\delta(x-f_k(z))\,dz.

On each branch,

δ(xfk(z))=δ(zzk)detJfk(zk).\delta(x-f_k(z)) = \frac{\delta(z-z_k)}{\left|\det J_{f_k}(z_k)\right|}.

Therefore,

pX(x)=kpZ(zk)detJfk(zk).p_X(x)= \sum_k \frac{p_Z(z_k)}{\left|\det J_{f_k}(z_k)\right|}.

Equivalently,

pX(x)=zf1(x)pZ(z)detJfbranch1(x).p_X(x)= \sum_{z\in f^{-1}(x)} p_Z(z) \left|\det J_{f^{-1}_{\text{branch}}}(x)\right|.

This is exact, but the sum may be expensive. Sorting has up to D!D! branches, for example.

7. Worked example: absolute value

Let

x=z,zR,x[0,).x=|z|, \qquad z\in\mathbb R, \qquad x\in[0,\infty).

For x>0x>0,

f1(x)={x,x}.f^{-1}(x)=\{x,-x\}.

The derivative magnitude is 1 on both branches, so

pX(x)=pZ(x)+pZ(x).p_X(x)=p_Z(x)+p_Z(-x).

Now introduce a stochastic inverse:

q(z=xx)=q+(x),q(z=xx)=q(x),q(z=x\mid x)=q_+(x), \qquad q(z=-x\mid x)=q_-(x),

with

q+(x)+q(x)=1.q_+(x)+q_-(x)=1.

Then Jensen’s inequality gives

logpX(x)Eq(zx)[logpZ(z)logq(zx)].\log p_X(x) \ge \mathbb E_{q(z\mid x)} \left[ \log p_Z(z)-\log q(z\mid x) \right].

Expanding the expectation,

L(x)=q+(x)[logpZ(x)logq+(x)]+q(x)[logpZ(x)logq(x)].\mathcal L(x)= q_+(x)\left[\log p_Z(x)-\log q_+(x)\right] + q_-(x)\left[\log p_Z(-x)-\log q_-(x)\right].

The bound is tight when q(zx)q(z\mid x) equals the true posterior over branches:

p(z=xx)=pZ(x)pZ(x)+pZ(x),p(z=x\mid x)= \frac{p_Z(x)}{p_Z(x)+p_Z(-x)},

and

p(z=xx)=pZ(x)pZ(x)+pZ(x).p(z=-x\mid x)= \frac{p_Z(-x)}{p_Z(x)+p_Z(-x)}.

8. Worked example: slicing and augmentation

Let

z=(x,u),z=(x,u),

and define a surjection that drops uu:

f(z)=x.f(z)=x.

The exact likelihood is

pX(x)=pZ(x,u)du.p_X(x)=\int p_Z(x,u)\,du.

This integral may be intractable. Introduce an inverse distribution

uq(ux).u\sim q(u\mid x).

Then

pX(x)=q(ux)pZ(x,u)q(ux)du.p_X(x) = \int q(u\mid x)\frac{p_Z(x,u)}{q(u\mid x)}\,du.

Thus

logpX(x)Eq(ux)[logpZ(x,u)logq(ux)].\log p_X(x) \ge \mathbb E_{q(u\mid x)} \left[ \log p_Z(x,u)-\log q(u\mid x) \right].

This is the same algebra as the VAE ELBO, but now interpreted as a SurVAE surjection.

9. Stochastic transformations

A fully stochastic transformation has both an inference density and a generative density:

zqZX(zx),xpXZ(xz).z\sim q_{Z\mid X}(z\mid x), \qquad x\sim p_{X\mid Z}(x\mid z).

The marginal likelihood is

pX(x)=pZ(z)pXZ(xz)dz.p_X(x)=\int p_Z(z)p_{X\mid Z}(x\mid z)\,dz.

Usually this integral is intractable, giving the lower bound

logpX(x)Eq(zx)[logpZ(z)+logpXZ(xz)logqZX(zx)].\log p_X(x) \ge \mathbb E_{q(z\mid x)} \left[ \log p_Z(z) + \log p_{X\mid Z}(x\mid z) - \log q_{Z\mid X}(z\mid x) \right].

This is the VAE case. SurVAE’s contribution is to treat this as one layer type inside a larger compositional flow.

10. Layerwise likelihood bookkeeping

Consider a composition

x=x0x1xK=z.x=x_0\to x_1\to\cdots\to x_K=z.

At the end, evaluate the base density

logpZ(z).\log p_Z(z).

Each layer contributes a correction.

For a bijection,

Δk=logdetJTk(xk1).\Delta_k= \log\left|\det J_{T_k}(x_{k-1})\right|.

For a stochastic or variational inverse layer,

Δk=logpk(xk1xk)logqk(xkxk1),\Delta_k= \log p_k(x_{k-1}\mid x_k)-\log q_k(x_k\mid x_{k-1}),

with deterministic delta/Jacobian terms handled analytically when present.

So the total exact likelihood or lower bound has the form

logpX(x)logpZ(z)+k=1KΔk.\log p_X(x) \gtrsim \log p_Z(z)+\sum_{k=1}^K\Delta_k.

The symbol \gtrsim means exact equality for fully exact transformations and a lower bound when stochastic inverses or variational approximations are used.

11. Connection back to Gaussianization

Classical Gaussianization says

xzN(0,I)x\mapsto z\sim \mathcal N(0,I)

using invertible transformations. SurVAE-style Gaussianization says the map may include operations that are useful but not invertible.

Examples:

For geoscience, this is natural. Many observation operators are not bijections:

high-resolution fieldcoarse-resolution field,\text{high-resolution field}\mapsto \text{coarse-resolution field},
3D atmospheric state2D column observation,\text{3D atmospheric state}\mapsto \text{2D column observation},
radiance spectrumretrieved methane column,\text{radiance spectrum}\mapsto \text{retrieved methane column},
continuous fieldquantized satellite product.\text{continuous field}\mapsto \text{quantized satellite product}.

These transformations lose information. SurVAE flows provide a density-estimation language for this situation: keep exact likelihoods when possible, introduce stochastic inverses when necessary, and track the resulting lower bound.

12. Summary

TransformationForward behaviorInverse behaviorLikelihood accounting
Bijectionone-to-onedeterministicexact change of variables
Surjectionmany-to-onebranch sum or stochastic inverseexact if summed; ELBO if sampled
Stochasticrandomstochasticvariational lower bound

The shortest useful mental model is

normalizing flows=Gaussianization by invertible transport\boxed{\text{normalizing flows} = \text{Gaussianization by invertible transport}}

and

SurVAE flows=Gaussianization by transport plus controlled information loss/addition.\boxed{\text{SurVAE flows} = \text{Gaussianization by transport plus controlled information loss/addition}.}

The Dirac delta proof is the bridge: it shows how deterministic transformations can be written as conditional densities, and how their likelihood corrections come from enforcing constraints and correcting volume.

References

References
  1. Nielsen, D., Jaini, P., Hoogeboom, E., Winther, O., & Welling, M. (2020). SurVAE Flows: Surjections to Bridge the Gap between VAEs and Flows. Advances in Neural Information Processing Systems (NeurIPS), 33, 12685–12696. https://proceedings.neurips.cc/paper/2020/hash/9578a63fbe545bd82cc5bbe749636af1-Abstract.html
  2. Meng, C., Song, Y., Song, J., & Ermon, S. (2020). Gaussianization Flows. arXiv:2003.01941. https://arxiv.org/abs/2003.01941
  3. Laparra, V., Camps-Valls, G., & Malo, J. (2011). Iterative Gaussianization: From ICA to Random Rotations. IEEE Transactions on Neural Networks, 22(4), 537–549. 10.1109/TNN.2011.2106511
  4. Rezende, D. J., & Mohamed, S. (2015). Variational Inference with Normalizing Flows. International Conference on Machine Learning (ICML).
  5. Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2017). Density Estimation using Real NVP. International Conference on Learning Representations (ICLR).
  6. Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. International Conference on Learning Representations (ICLR).