SurVAE flows, Gaussianization, and likelihood accounting
This note gives a deliberately slow proof of the likelihood rules behind
SurVAE flows from the perspective of Gaussianization. The goal is to explain
why ordinary normalizing flows, surjective transformations, and stochastic
VAE-like transformations can all be treated as composable layers with local
likelihood contributions.
The main references are SurVAE Flows Nielsen et al. , 2020 ,
Gaussianization Flows Meng et al. , 2020 , iterative
Gaussianization Laparra et al. , 2011 , and the standard normalizing
flow / VAE literature Rezende & Mohamed, 2015 Dinh et al. , 2017 Kingma & Welling, 2014 .
A normalizing flow Gaussianizes data by an invertible transport map. A SurVAE
flow generalizes this idea: it Gaussianizes using invertible maps, information
losing maps, and stochastic maps, while keeping track of the likelihood or a
variational lower bound layer by layer.
1. Gaussianization as density estimation ¶ Let
x ∈ X , z ∈ Z , p Z ( z ) = N ( z ; 0 , I ) . x \in \mathcal X,
\qquad
z \in \mathcal Z,
\qquad
p_Z(z)=\mathcal N(z;0,I). x ∈ X , z ∈ Z , p Z ( z ) = N ( z ; 0 , I ) . In Gaussianization, we learn a map
T : X → Z , z = T ( x ) , T:\mathcal X\to\mathcal Z,
\qquad
z=T(x), T : X → Z , z = T ( x ) , so that the transformed data look approximately standard Gaussian:
z = T ( x ) ∼ N ( 0 , I ) . z=T(x)\sim \mathcal N(0,I). z = T ( x ) ∼ N ( 0 , I ) . If T T T is bijective and differentiable, the likelihood follows from the
ordinary change-of-variables formula:
p X ( x ) = p Z ( T ( x ) ) ∣ det J T ( x ) ∣ . p_X(x)=p_Z(T(x))\left|\det J_T(x)\right|. p X ( x ) = p Z ( T ( x )) ∣ det J T ( x ) ∣ . Equivalently,
log p X ( x ) = log p Z ( T ( x ) ) + log ∣ det J T ( x ) ∣ . \log p_X(x)=\log p_Z(T(x)) + \log\left|\det J_T(x)\right|. log p X ( x ) = log p Z ( T ( x )) + log ∣ det J T ( x ) ∣ . There are two common directions. The analysis or Gaussianization
direction maps data to latent variables, z = T ( x ) z=T(x) z = T ( x ) . The generative direction
maps latent variables to data, x = T − 1 ( z ) x=T^{-1}(z) x = T − 1 ( z ) . The log determinant changes sign
depending on which direction is used.
For a composition of bijective Gaussianization layers,
x = x 0 ↦ x 1 ↦ ⋯ ↦ x K = z , x=x_0 \mapsto x_1 \mapsto \cdots \mapsto x_K=z, x = x 0 ↦ x 1 ↦ ⋯ ↦ x K = z , we get
log p X ( x ) = log p Z ( x K ) + ∑ k = 1 K log ∣ det J T k ( x k − 1 ) ∣ . \log p_X(x)
=
\log p_Z(x_K)
+
\sum_{k=1}^K
\log\left|\det J_{T_k}(x_{k-1})\right|. log p X ( x ) = log p Z ( x K ) + k = 1 ∑ K log ∣ det J T k ( x k − 1 ) ∣ . This is the classical normalizing-flow story.
SurVAE flows ask: what if some useful transformations are not bijections?
Examples include sorting, absolute value, max pooling, slicing, augmentation,
dequantization, periodic wrapping, and VAE-style stochastic maps. These are
natural in image modeling, representation learning, and geoscience, where many
forward operators lose information.
2. The universal latent-variable identity ¶ Start from the marginal likelihood identity
p X ( x ) = ∫ p X , Z ( x , z ) d z . p_X(x)=\int p_{X,Z}(x,z)\,dz. p X ( x ) = ∫ p X , Z ( x , z ) d z . Factor the joint distribution generatively:
p X , Z ( x , z ) = p Z ( z ) p X ∣ Z ( x ∣ z ) . p_{X,Z}(x,z)=p_Z(z)p_{X\mid Z}(x\mid z). p X , Z ( x , z ) = p Z ( z ) p X ∣ Z ( x ∣ z ) . Then
p X ( x ) = ∫ p Z ( z ) p X ∣ Z ( x ∣ z ) d z . p_X(x)=\int p_Z(z)p_{X\mid Z}(x\mid z)\,dz. p X ( x ) = ∫ p Z ( z ) p X ∣ Z ( x ∣ z ) d z . Now introduce any auxiliary inverse or inference density
q Z ∣ X ( z ∣ x ) , q_{Z\mid X}(z\mid x), q Z ∣ X ( z ∣ x ) , assuming it is positive wherever the integrand is positive. Then
p X ( x ) = ∫ q Z ∣ X ( z ∣ x ) p Z ( z ) p X ∣ Z ( x ∣ z ) q Z ∣ X ( z ∣ x ) d z . p_X(x)
=
\int q_{Z\mid X}(z\mid x)
\frac{p_Z(z)p_{X\mid Z}(x\mid z)}{q_{Z\mid X}(z\mid x)}\,dz. p X ( x ) = ∫ q Z ∣ X ( z ∣ x ) q Z ∣ X ( z ∣ x ) p Z ( z ) p X ∣ Z ( x ∣ z ) d z . Taking logs gives
log p X ( x ) = log E q ( z ∣ x ) [ p Z ( z ) p X ∣ Z ( x ∣ z ) q Z ∣ X ( z ∣ x ) ] . \log p_X(x)
=
\log
\mathbb E_{q(z\mid x)}
\left[
\frac{p_Z(z)p_{X\mid Z}(x\mid z)}{q_{Z\mid X}(z\mid x)}
\right]. log p X ( x ) = log E q ( z ∣ x ) [ q Z ∣ X ( z ∣ x ) p Z ( z ) p X ∣ Z ( x ∣ z ) ] . By Jensen’s inequality,
log p X ( x ) ≥ E q ( z ∣ x ) [ log p Z ( z ) + log p X ∣ Z ( x ∣ z ) − log q Z ∣ X ( z ∣ x ) ] . \log p_X(x)
\ge
\mathbb E_{q(z\mid x)}
\left[
\log p_Z(z)
+
\log p_{X\mid Z}(x\mid z)
-
\log q_{Z\mid X}(z\mid x)
\right]. log p X ( x ) ≥ E q ( z ∣ x ) [ log p Z ( z ) + log p X ∣ Z ( x ∣ z ) − log q Z ∣ X ( z ∣ x ) ] . SurVAE flows are easiest to understand from this identity. Every layer asks:
what is the forward generative density p ( x ∣ z ) p(x\mid z) p ( x ∣ z ) , what is the inverse density
q ( z ∣ x ) q(z\mid x) q ( z ∣ x ) , and which terms are exact versus variational?
A VAE uses this lower bound directly. A bijective normalizing flow is a special
case where the bound is exact because the inverse is deterministic and unique.
SurVAE flows organize many transformation types under this same accounting
system.
Assume
z = T ( x ) , x = T − 1 ( z ) , z=T(x),
\qquad
x=T^{-1}(z), z = T ( x ) , x = T − 1 ( z ) , where T : R D → R D T:\mathbb R^D\to\mathbb R^D T : R D → R D is a differentiable bijection.
3.1 Volume-element proof ¶ For a small region A ⊂ X A\subset \mathcal X A ⊂ X ,
P ( x ∈ A ) = P ( z ∈ T ( A ) ) . \mathbb P(x\in A)=\mathbb P(z\in T(A)). P ( x ∈ A ) = P ( z ∈ T ( A )) . Locally,
d z = ∣ det J T ( x ) ∣ d x . dz=\left|\det J_T(x)\right|dx. d z = ∣ det J T ( x ) ∣ d x . Therefore,
p X ( x ) d x = p Z ( z ) d z . p_X(x)dx=p_Z(z)dz. p X ( x ) d x = p Z ( z ) d z . Substitute z = T ( x ) z=T(x) z = T ( x ) :
p X ( x ) d x = p Z ( T ( x ) ) ∣ det J T ( x ) ∣ d x . p_X(x)dx=p_Z(T(x))\left|\det J_T(x)\right|dx. p X ( x ) d x = p Z ( T ( x )) ∣ det J T ( x ) ∣ d x . Cancel d x dx d x :
p X ( x ) = p Z ( T ( x ) ) ∣ det J T ( x ) ∣ . p_X(x)=p_Z(T(x))\left|\det J_T(x)\right|. p X ( x ) = p Z ( T ( x )) ∣ det J T ( x ) ∣ . Thus,
log p X ( x ) = log p Z ( T ( x ) ) + log ∣ det J T ( x ) ∣ . \log p_X(x)=\log p_Z(T(x))+\log\left|\det J_T(x)\right|. log p X ( x ) = log p Z ( T ( x )) + log ∣ det J T ( x ) ∣ . The Jacobian determinant is a local volume correction. If T T T expands a small
volume around x x x , then the latent density must be pulled back with a larger
factor. If T T T contracts volume, the correction is smaller.
3.2 Dirac-delta proof ¶ Now write the generative direction as
z ∼ p Z ( z ) , x = f ( z ) , z\sim p_Z(z),
\qquad
x=f(z), z ∼ p Z ( z ) , x = f ( z ) , where f = T − 1 f=T^{-1} f = T − 1 . Since x x x is deterministic given z z z , the conditional density
is a Dirac delta:
p X ∣ Z ( x ∣ z ) = δ ( x − f ( z ) ) . p_{X\mid Z}(x\mid z)=\delta(x-f(z)). p X ∣ Z ( x ∣ z ) = δ ( x − f ( z )) . Hence
p X ( x ) = ∫ p Z ( z ) δ ( x − f ( z ) ) d z . p_X(x)=\int p_Z(z)\delta(x-f(z))\,dz. p X ( x ) = ∫ p Z ( z ) δ ( x − f ( z )) d z . Because f f f is bijective, the equation
has exactly one solution
z = f − 1 ( x ) = T ( x ) . z=f^{-1}(x)=T(x). z = f − 1 ( x ) = T ( x ) . The multivariate delta identity gives
δ ( x − f ( z ) ) = δ ( z − f − 1 ( x ) ) ∣ det J f ( f − 1 ( x ) ) ∣ . \delta(x-f(z))
=
\frac{\delta(z-f^{-1}(x))}
{\left|\det J_f(f^{-1}(x))\right|}. δ ( x − f ( z )) = ∣ det J f ( f − 1 ( x )) ∣ δ ( z − f − 1 ( x )) . Therefore,
p X ( x ) = ∫ p Z ( z ) δ ( z − f − 1 ( x ) ) ∣ det J f ( f − 1 ( x ) ) ∣ d z . p_X(x)
=
\int p_Z(z)
\frac{\delta(z-f^{-1}(x))}
{\left|\det J_f(f^{-1}(x))\right|}
\,dz. p X ( x ) = ∫ p Z ( z ) ∣ det J f ( f − 1 ( x )) ∣ δ ( z − f − 1 ( x )) d z . The denominator is constant with respect to z z z , so
p X ( x ) = 1 ∣ det J f ( f − 1 ( x ) ) ∣ ∫ p Z ( z ) δ ( z − f − 1 ( x ) ) d z . p_X(x)
=
\frac{1}
{\left|\det J_f(f^{-1}(x))\right|}
\int p_Z(z)\delta(z-f^{-1}(x))\,dz. p X ( x ) = ∣ det J f ( f − 1 ( x )) ∣ 1 ∫ p Z ( z ) δ ( z − f − 1 ( x )) d z . Using the sifting property of the delta function,
∫ p Z ( z ) δ ( z − f − 1 ( x ) ) d z = p Z ( f − 1 ( x ) ) . \int p_Z(z)\delta(z-f^{-1}(x))\,dz
=
p_Z(f^{-1}(x)). ∫ p Z ( z ) δ ( z − f − 1 ( x )) d z = p Z ( f − 1 ( x )) . Thus
p X ( x ) = p Z ( f − 1 ( x ) ) ∣ det J f ( f − 1 ( x ) ) ∣ . p_X(x)
=
\frac{p_Z(f^{-1}(x))}
{\left|\det J_f(f^{-1}(x))\right|}. p X ( x ) = ∣ det J f ( f − 1 ( x )) ∣ p Z ( f − 1 ( x )) . Since T = f − 1 T=f^{-1} T = f − 1 ,
∣ det J T ( x ) ∣ = 1 ∣ det J f ( f − 1 ( x ) ) ∣ , \left|\det J_T(x)\right|
=
\frac{1}
{\left|\det J_f(f^{-1}(x))\right|}, ∣ det J T ( x ) ∣ = ∣ det J f ( f − 1 ( x )) ∣ 1 , so
p X ( x ) = p Z ( T ( x ) ) ∣ det J T ( x ) ∣ . p_X(x)=p_Z(T(x))\left|\det J_T(x)\right|. p X ( x ) = p Z ( T ( x )) ∣ det J T ( x ) ∣ . 4. Bijections as degenerate VAEs ¶ A bijective flow can be written as a latent-variable model with deterministic
encoder and decoder:
q Z ∣ X ( z ∣ x ) = δ ( z − T ( x ) ) , q_{Z\mid X}(z\mid x)=\delta(z-T(x)), q Z ∣ X ( z ∣ x ) = δ ( z − T ( x )) , and
p X ∣ Z ( x ∣ z ) = δ ( x − T − 1 ( z ) ) . p_{X\mid Z}(x\mid z)=\delta(x-T^{-1}(z)). p X ∣ Z ( x ∣ z ) = δ ( x − T − 1 ( z )) . There is no posterior uncertainty because each x x x corresponds to exactly one
z z z . Therefore, the variational lower bound is tight. This is why normalizing
flows give exact likelihoods.
A map
f : Z → X f:\mathcal Z\to\mathcal X f : Z → X is surjective if every x ∈ X x\in\mathcal X x ∈ X has at least one preimage, but possibly
many:
f − 1 ( x ) = { z : f ( z ) = x } . f^{-1}(x)=\{z:f(z)=x\}. f − 1 ( x ) = { z : f ( z ) = x } . Generatively,
z ∼ p Z ( z ) , x = f ( z ) . z\sim p_Z(z),
\qquad
x=f(z). z ∼ p Z ( z ) , x = f ( z ) . The forward map is deterministic, but the inverse is ambiguous.
Examples:
where z = x z=x z = x and z = − x z=-x z = − x both map to the same value;
x = sort ( z ) , x=\operatorname{sort}(z), x = sort ( z ) , where all permutations of z z z map to the same sorted vector; and
x = slice ( z ) , x=\operatorname{slice}(z), x = slice ( z ) , where some coordinates are discarded.
6. Exact likelihood for finite-to-one surjections ¶ Assume f : R D → R D f:\mathbb R^D\to\mathbb R^D f : R D → R D is many-to-one but locally invertible on
branches. Let the domain decompose into branches
Z = ⋃ k Z k , \mathcal Z=\bigcup_k \mathcal Z_k, Z = k ⋃ Z k , and let
f k : Z k → X f_k:\mathcal Z_k\to\mathcal X f k : Z k → X be bijective on each branch. For a given x x x , define
z k = f k − 1 ( x ) . z_k=f_k^{-1}(x). z k = f k − 1 ( x ) . Start again from the delta representation:
p X ( x ) = ∫ p Z ( z ) δ ( x − f ( z ) ) d z . p_X(x)=\int p_Z(z)\delta(x-f(z))\,dz. p X ( x ) = ∫ p Z ( z ) δ ( x − f ( z )) d z . Split the integral over branches:
p X ( x ) = ∑ k ∫ Z k p Z ( z ) δ ( x − f k ( z ) ) d z . p_X(x)=
\sum_k
\int_{\mathcal Z_k}p_Z(z)\delta(x-f_k(z))\,dz. p X ( x ) = k ∑ ∫ Z k p Z ( z ) δ ( x − f k ( z )) d z . On each branch,
δ ( x − f k ( z ) ) = δ ( z − z k ) ∣ det J f k ( z k ) ∣ . \delta(x-f_k(z))
=
\frac{\delta(z-z_k)}{\left|\det J_{f_k}(z_k)\right|}. δ ( x − f k ( z )) = ∣ det J f k ( z k ) ∣ δ ( z − z k ) . Therefore,
p X ( x ) = ∑ k p Z ( z k ) ∣ det J f k ( z k ) ∣ . p_X(x)=
\sum_k
\frac{p_Z(z_k)}{\left|\det J_{f_k}(z_k)\right|}. p X ( x ) = k ∑ ∣ det J f k ( z k ) ∣ p Z ( z k ) . Equivalently,
p X ( x ) = ∑ z ∈ f − 1 ( x ) p Z ( z ) ∣ det J f branch − 1 ( x ) ∣ . p_X(x)=
\sum_{z\in f^{-1}(x)}
p_Z(z)
\left|\det J_{f^{-1}_{\text{branch}}}(x)\right|. p X ( x ) = z ∈ f − 1 ( x ) ∑ p Z ( z ) ∣ ∣ det J f branch − 1 ( x ) ∣ ∣ . This is exact, but the sum may be expensive. Sorting has up to D ! D! D ! branches,
for example.
7. Worked example: absolute value ¶ Let
x = ∣ z ∣ , z ∈ R , x ∈ [ 0 , ∞ ) . x=|z|,
\qquad z\in\mathbb R,
\qquad x\in[0,\infty). x = ∣ z ∣ , z ∈ R , x ∈ [ 0 , ∞ ) . For x > 0 x>0 x > 0 ,
f − 1 ( x ) = { x , − x } . f^{-1}(x)=\{x,-x\}. f − 1 ( x ) = { x , − x } . The derivative magnitude is 1 on both branches, so
p X ( x ) = p Z ( x ) + p Z ( − x ) . p_X(x)=p_Z(x)+p_Z(-x). p X ( x ) = p Z ( x ) + p Z ( − x ) . Now introduce a stochastic inverse:
q ( z = x ∣ x ) = q + ( x ) , q ( z = − x ∣ x ) = q − ( x ) , q(z=x\mid x)=q_+(x),
\qquad
q(z=-x\mid x)=q_-(x), q ( z = x ∣ x ) = q + ( x ) , q ( z = − x ∣ x ) = q − ( x ) , with
q + ( x ) + q − ( x ) = 1. q_+(x)+q_-(x)=1. q + ( x ) + q − ( x ) = 1. Then Jensen’s inequality gives
log p X ( x ) ≥ E q ( z ∣ x ) [ log p Z ( z ) − log q ( z ∣ x ) ] . \log p_X(x)
\ge
\mathbb E_{q(z\mid x)}
\left[
\log p_Z(z)-\log q(z\mid x)
\right]. log p X ( x ) ≥ E q ( z ∣ x ) [ log p Z ( z ) − log q ( z ∣ x ) ] . Expanding the expectation,
L ( x ) = q + ( x ) [ log p Z ( x ) − log q + ( x ) ] + q − ( x ) [ log p Z ( − x ) − log q − ( x ) ] . \mathcal L(x)=
q_+(x)\left[\log p_Z(x)-\log q_+(x)\right]
+
q_-(x)\left[\log p_Z(-x)-\log q_-(x)\right]. L ( x ) = q + ( x ) [ log p Z ( x ) − log q + ( x ) ] + q − ( x ) [ log p Z ( − x ) − log q − ( x ) ] . The bound is tight when q ( z ∣ x ) q(z\mid x) q ( z ∣ x ) equals the true posterior over branches:
p ( z = x ∣ x ) = p Z ( x ) p Z ( x ) + p Z ( − x ) , p(z=x\mid x)=
\frac{p_Z(x)}{p_Z(x)+p_Z(-x)}, p ( z = x ∣ x ) = p Z ( x ) + p Z ( − x ) p Z ( x ) , and
p ( z = − x ∣ x ) = p Z ( − x ) p Z ( x ) + p Z ( − x ) . p(z=-x\mid x)=
\frac{p_Z(-x)}{p_Z(x)+p_Z(-x)}. p ( z = − x ∣ x ) = p Z ( x ) + p Z ( − x ) p Z ( − x ) . 8. Worked example: slicing and augmentation ¶ Let
and define a surjection that drops u u u :
The exact likelihood is
p X ( x ) = ∫ p Z ( x , u ) d u . p_X(x)=\int p_Z(x,u)\,du. p X ( x ) = ∫ p Z ( x , u ) d u . This integral may be intractable. Introduce an inverse distribution
u ∼ q ( u ∣ x ) . u\sim q(u\mid x). u ∼ q ( u ∣ x ) . Then
p X ( x ) = ∫ q ( u ∣ x ) p Z ( x , u ) q ( u ∣ x ) d u . p_X(x)
=
\int q(u\mid x)\frac{p_Z(x,u)}{q(u\mid x)}\,du. p X ( x ) = ∫ q ( u ∣ x ) q ( u ∣ x ) p Z ( x , u ) d u . Thus
log p X ( x ) ≥ E q ( u ∣ x ) [ log p Z ( x , u ) − log q ( u ∣ x ) ] . \log p_X(x)
\ge
\mathbb E_{q(u\mid x)}
\left[
\log p_Z(x,u)-\log q(u\mid x)
\right]. log p X ( x ) ≥ E q ( u ∣ x ) [ log p Z ( x , u ) − log q ( u ∣ x ) ] . This is the same algebra as the VAE ELBO, but now interpreted as a SurVAE
surjection.
A fully stochastic transformation has both an inference density and a generative
density:
z ∼ q Z ∣ X ( z ∣ x ) , x ∼ p X ∣ Z ( x ∣ z ) . z\sim q_{Z\mid X}(z\mid x),
\qquad
x\sim p_{X\mid Z}(x\mid z). z ∼ q Z ∣ X ( z ∣ x ) , x ∼ p X ∣ Z ( x ∣ z ) . The marginal likelihood is
p X ( x ) = ∫ p Z ( z ) p X ∣ Z ( x ∣ z ) d z . p_X(x)=\int p_Z(z)p_{X\mid Z}(x\mid z)\,dz. p X ( x ) = ∫ p Z ( z ) p X ∣ Z ( x ∣ z ) d z . Usually this integral is intractable, giving the lower bound
log p X ( x ) ≥ E q ( z ∣ x ) [ log p Z ( z ) + log p X ∣ Z ( x ∣ z ) − log q Z ∣ X ( z ∣ x ) ] . \log p_X(x)
\ge
\mathbb E_{q(z\mid x)}
\left[
\log p_Z(z)
+
\log p_{X\mid Z}(x\mid z)
-
\log q_{Z\mid X}(z\mid x)
\right]. log p X ( x ) ≥ E q ( z ∣ x ) [ log p Z ( z ) + log p X ∣ Z ( x ∣ z ) − log q Z ∣ X ( z ∣ x ) ] . This is the VAE case. SurVAE’s contribution is to treat this as one layer type
inside a larger compositional flow.
10. Layerwise likelihood bookkeeping ¶ Consider a composition
x = x 0 → x 1 → ⋯ → x K = z . x=x_0\to x_1\to\cdots\to x_K=z. x = x 0 → x 1 → ⋯ → x K = z . At the end, evaluate the base density
log p Z ( z ) . \log p_Z(z). log p Z ( z ) . Each layer contributes a correction.
For a bijection,
Δ k = log ∣ det J T k ( x k − 1 ) ∣ . \Delta_k=
\log\left|\det J_{T_k}(x_{k-1})\right|. Δ k = log ∣ det J T k ( x k − 1 ) ∣ . For a stochastic or variational inverse layer,
Δ k = log p k ( x k − 1 ∣ x k ) − log q k ( x k ∣ x k − 1 ) , \Delta_k=
\log p_k(x_{k-1}\mid x_k)-\log q_k(x_k\mid x_{k-1}), Δ k = log p k ( x k − 1 ∣ x k ) − log q k ( x k ∣ x k − 1 ) , with deterministic delta/Jacobian terms handled analytically when present.
So the total exact likelihood or lower bound has the form
log p X ( x ) ≳ log p Z ( z ) + ∑ k = 1 K Δ k . \log p_X(x)
\gtrsim
\log p_Z(z)+\sum_{k=1}^K\Delta_k. log p X ( x ) ≳ log p Z ( z ) + k = 1 ∑ K Δ k . The symbol ≳ \gtrsim ≳ means exact equality for fully exact transformations and a
lower bound when stochastic inverses or variational approximations are used.
A SurVAE layer needs two things: a forward sample/evaluate rule and a local log
contribution. Bijections contribute log determinants. Surjections contribute
branch, inverse, or entropy corrections. Stochastic layers contribute
log p − log q \log p-\log q log p − log q terms.
11. Connection back to Gaussianization ¶ Classical Gaussianization says
x ↦ z ∼ N ( 0 , I ) x\mapsto z\sim \mathcal N(0,I) x ↦ z ∼ N ( 0 , I ) using invertible transformations. SurVAE-style Gaussianization says the map may
include operations that are useful but not invertible.
Examples:
periodic wrapping canonicalizes angles but loses winding number; sorting canonicalizes permutation symmetry but loses the original order; pooling summarizes local patches but loses sub-patch detail; slicing projects from a higher-dimensional latent representation to observed
coordinates; dequantization maps discrete observations into continuous latent variables. For geoscience, this is natural. Many observation operators are not bijections:
high-resolution field ↦ coarse-resolution field , \text{high-resolution field}\mapsto \text{coarse-resolution field}, high-resolution field ↦ coarse-resolution field , 3D atmospheric state ↦ 2D column observation , \text{3D atmospheric state}\mapsto \text{2D column observation}, 3D atmospheric state ↦ 2D column observation , radiance spectrum ↦ retrieved methane column , \text{radiance spectrum}\mapsto \text{retrieved methane column}, radiance spectrum ↦ retrieved methane column , continuous field ↦ quantized satellite product . \text{continuous field}\mapsto \text{quantized satellite product}. continuous field ↦ quantized satellite product . These transformations lose information. SurVAE flows provide a density-estimation
language for this situation: keep exact likelihoods when possible, introduce
stochastic inverses when necessary, and track the resulting lower bound.
12. Summary ¶ Transformation Forward behavior Inverse behavior Likelihood accounting Bijection one-to-one deterministic exact change of variables Surjection many-to-one branch sum or stochastic inverse exact if summed; ELBO if sampled Stochastic random stochastic variational lower bound
The shortest useful mental model is
normalizing flows = Gaussianization by invertible transport \boxed{\text{normalizing flows} = \text{Gaussianization by invertible transport}} normalizing flows = Gaussianization by invertible transport and
SurVAE flows = Gaussianization by transport plus controlled information loss/addition . \boxed{\text{SurVAE flows} = \text{Gaussianization by transport plus controlled information loss/addition}.} SurVAE flows = Gaussianization by transport plus controlled information loss/addition . The Dirac delta proof is the bridge: it shows how deterministic transformations
can be written as conditional densities, and how their likelihood corrections
come from enforcing constraints and correcting volume.
References ¶
Nielsen, D., Jaini, P., Hoogeboom, E., Winther, O., & Welling, M. (2020). SurVAE Flows: Surjections to Bridge the Gap between VAEs and Flows. Advances in Neural Information Processing Systems (NeurIPS) , 33 , 12685–12696. https://proceedings.neurips.cc/paper/2020/hash/9578a63fbe545bd82cc5bbe749636af1-Abstract.html Meng, C., Song, Y., Song, J., & Ermon, S. (2020). Gaussianization Flows. arXiv:2003.01941 . https://arxiv.org/abs/2003.01941 Laparra, V., Camps-Valls, G., & Malo, J. (2011). Iterative Gaussianization: From ICA to Random Rotations. IEEE Transactions on Neural Networks , 22 (4), 537–549. 10.1109/TNN.2011.2106511 Rezende, D. J., & Mohamed, S. (2015). Variational Inference with Normalizing Flows. International Conference on Machine Learning (ICML) . Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2017). Density Estimation using Real NVP. International Conference on Learning Representations (ICLR) . Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. International Conference on Learning Representations (ICLR) .