Zero-kernel ≡ diagonal
Zero-kernel coupling ≡ diagonal Gaussianization¶
A coupling flow with mixture-CDF blocks normally captures cross-dim dependence via the conditioner. But when the conditioner is zero-initialised — specifically, when the final Dense of the conditioner has kernel — something interesting happens: the flow collapses into a plain diagonal Gaussianization flow.
This notebook:
- Recaps the four objects at play (the mixture-CDF primitive, the diagonal Gaussianization flow, iterative vs parametric training, the coupling flow).
- Works through the math showing the equivalence (Lemmas 1–3 → Theorem → Corollary).
- Verifies it empirically: build matched diagonal and coupling flows, IG-initialise both, compare pushforwards element-wise.
- Shows that a single gradient step is enough for the coupling flow to break the equivalence — its conditioner starts modulating on and the flow becomes strictly more expressive.
The upshot: IG init on a coupling flow gives you the best of both worlds — a well-understood RBIG starting point and a conditioner waiting to do cross-dim work once training begins.
1. Mathematical recap¶
Self-contained summary of the pieces so this notebook stands on its own.
Notation and shapes¶
- — data dimensionality, — mixture components per marginal, — batch size, — number of blocks in the flow.
- Random vector on (what we have data from). Latent (what we want the flow to land on).
- Bijective map with Jacobian .
Change of variables (the one identity behind everything)¶
For a bijection and any base density ,
Training minimises the negative of this averaged over data — the NLL.
1-D mixture-CDF Gaussianization (the primitive)¶
Classical probability integral transform: if has strictly increasing CDF , then , so . We approximate with a mixture of normals,
Parameter shapes: (weights on the simplex), , . The forward map and its log-derivative are
Inverse uses monotone bisection on .
Diagonal marginal Gaussianization layer (MixtureCDFGaussianization)¶
Lift the 1-D primitive to dims by applying it independently per dim. Parameters are tensors of shape :
Forward: for each . Jacobian is diagonal, so
Rotation layer¶
Orthogonal — either a Householder product with , or a fixed (frozen) matrix. , so rotations contribute zero to the flow’s log-det. Their role is to redistribute non-Gaussianity across dims between marginal blocks — a marginal layer can only Gaussianize dim-by-dim, so the rotation “mixes” structure that’s otherwise stuck in single dims.
A full Gaussianization flow: blocks of (rotation, marginal)¶
with total log-det (rotations drop out).
Iterative Gaussianization — RBIG (Laparra & Malo 2011)¶
Greedy, non-gradient fit. Walk through the flow one block at a time. Starting with :
- Fit PCA, rotate: .
- Fit per-dim GMM via EM on each column of → of shape .
- Apply the marginal transform:
marg. - Increment and repeat.
Laparra & Malo show the negentropy decreases at a geometric rate under mild conditions. Remarkably, this already gives a decent density estimator — before any gradient descent.
Parametric Gaussianization¶
Same architecture as above. All parameters are learned jointly by gradient descent on the NLL. Gradient flows through the whole composition, so later blocks can compensate for finite- approximation errors in earlier blocks — something RBIG fundamentally cannot do.
Coupling flow (the MixtureCDFCoupling layer)¶
Swap the diagonal marginal for a conditional transform. Pick a mask . It splits into an “identity” half (indices where , size ) and a “transformed” half (indices where , size ).
A conditioner maps to the mixture parameters that will Gaussianize . Decompose it through its final Dense as
where is the output of the MLP body, , . The output is reshaped to and split into — one mixture per -dim, depending on .
The forward map:
Jacobian is block-triangular (top half is the identity on , bottom-left carries , bottom-right is diagonal). Only the bottom-right diagonal enters the determinant:
Same form as the diagonal marginal’s log-det — only difference is that the mixture params are functions of instead of fixed.
make_coupling_flow stacks blocks of : the mask swap ensures that both halves get transformed within each block.
The equivalence we’re going to prove¶
A diagonal Gaussianization flow is strictly less expressive than a coupling flow — the coupling has the freedom to modulate its mixture on (through θ); the diagonal cannot. But when the conditioner’s final Dense has , the function collapses: is the same for every , and the coupling layer reduces exactly to a diagonal marginal with params unpacked from . This is the regime IG init puts the coupling flow into.
2. The setup — coupling layer, decomposed¶
With the recap in hand, zoom in on a single coupling layer. Input (batch , dim ). Mask picks the “identity” dims and “transformed” dims. Write and for the two halves.
The conditioner final-Dense decomposition repeated here for convenience:
IG init sets and to the RBIG fit.
Lemma 1 — the conditioner becomes a constant¶
With ,
Every example sees the same mixture parameters. The forward pass on is therefore
a function of alone.
Lemma 2 — the log-det collapses to a diagonal¶
The coupling Jacobian is block-triangular:
The off-diagonal block carries through the chain rule, but with we get , so too. The matrix is now block-diagonal, and
Which is exactly what a diagonal MixtureCDFGaussianization layer would produce on the same -indices with the same mixture params.
Lemma 3 — two complementary-mask layers pair into one full- marginal (pedantic)¶
A single coupling layer only transforms the -dim half. make_coupling_flow uses pairs [couple(m), couple(~m)] inside each block. Claim: with zero-kernel init, one such pair equals one full- diagonal marginal layer. Proof by index chase.
Let the first layer use mask with index set , , , . Its zero-kernel parameters pack into a bias encoding per-b-dim params for . Forward:
The second layer uses . Its index sets are swapped: , . With its zero-kernel bias encoding params for , forward:
Now substitute in terms of . For we have (unchanged by layer 2). For we have (untouched by layer 1), so . Combining:
Each dimension is Gaussianized independently by its own mixture, with no cross-dim interaction — the definition of a full- diagonal marginal layer. Take its parameters to be
and the coupling pair equals MixtureCDFGaussianization with those parameters. Log-dets add the same way:
Theorem — full-flow equivalence at zero-kernel init¶
A coupling flow
with zero-kernel init on every coupling layer reduces, by Lemma 3 applied block-by-block, to
provided the rotations and marginal parameters are matched. initialize_flow_from_ig walks both kinds of flow with the same per-block PCA and per-dim GMM fits (seeded by (block_idx, dim_idx)), so the matching holds automatically.
Corollary — both flows equal RBIG¶
Classical RBIG produces exactly the sequence of transforms that initialize_flow_from_ig assigns, so
Time for a numerical check.
import os
os.environ["KERAS_BACKEND"] = "tensorflow"
import keras
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from keras import ops
from sklearn.datasets import make_moons
from gaussianization.gauss_keras import (
base_nll_loss,
initialize_flow_from_ig,
make_coupling_flow,
make_gaussianization_flow,
)
# --- global plot styling ------------------------------------------------------
sns.set_theme(context="poster", style="whitegrid", palette="deep", font_scale=0.85)
plt.rcParams.update(
{
"figure.dpi": 110,
"savefig.dpi": 110,
"savefig.bbox": "tight",
"savefig.pad_inches": 0.2,
"figure.constrained_layout.use": True,
"figure.constrained_layout.h_pad": 0.1,
"figure.constrained_layout.w_pad": 0.1,
"axes.grid.which": "both",
"grid.linewidth": 0.7,
"grid.alpha": 0.5,
"axes.edgecolor": "0.25",
"axes.linewidth": 1.1,
"axes.titleweight": "semibold",
"axes.labelpad": 6,
}
)
def style_axes(ax, *, aspect=None, grid=True):
if grid:
ax.minorticks_on()
ax.grid(True, which="major", linewidth=0.8, alpha=0.6)
ax.grid(True, which="minor", linewidth=0.4, alpha=0.3)
if aspect is not None:
ax.set_aspect(aspect)
return ax
def style_jointgrid(g, aspect="equal"):
style_axes(g.ax_joint, aspect=aspect)
style_axes(g.ax_marg_x, grid=False)
style_axes(g.ax_marg_y, grid=False)
for spine in ("top", "right"):
g.ax_marg_x.spines[spine].set_visible(False)
g.ax_marg_y.spines[spine].set_visible(False)
palette = sns.color_palette("deep")
COLOR_CPL = palette[0]
COLOR_DIAG = palette[2]
COLOR_DELTA = palette[3]
keras.utils.set_random_seed(0)
np.random.seed(0)WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1776870379.761267 442045 cpu_feature_guard.cc:227] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
3. Matched flows on two moons¶
Same dataset as elsewhere. Build a diagonal flow and a coupling flow, both with num_blocks=4 and num_components=8, then IG-init both on the training data. Both flows see identical PCA and GMM fits by construction of initialize_flow_from_ig.
X_raw, _ = make_moons(n_samples=5000, noise=0.05, random_state=0)
X = (X_raw - X_raw.mean(axis=0)) / X_raw.std(axis=0)
X = X.astype("float32")
NUM_BLOCKS = 4
NUM_COMPONENTS = 8
keras.utils.set_random_seed(11)
flow_diag = make_gaussianization_flow(
input_dim=2,
num_blocks=NUM_BLOCKS,
num_reflectors=2,
num_components=NUM_COMPONENTS,
)
_ = flow_diag(ops.convert_to_tensor(X[:4]))
initialize_flow_from_ig(flow_diag, X)
keras.utils.set_random_seed(11)
flow_cpl = make_coupling_flow(
input_dim=2,
num_blocks=NUM_BLOCKS,
num_components=NUM_COMPONENTS,
hidden=(32, 32),
)
_ = flow_cpl(ops.convert_to_tensor(X[:4]))
initialize_flow_from_ig(flow_cpl, X)
print(f"diagonal flow bijectors : {len(flow_diag.bijector_layers)}")
print(f"coupling flow bijectors : {len(flow_cpl.bijector_layers)}")
print(f"diagonal flow trainable params : "
f"{int(sum(np.prod(w.shape) for w in flow_diag.trainable_weights))}")
print(f"coupling flow trainable params : "
f"{int(sum(np.prod(w.shape) for w in flow_cpl.trainable_weights))}")E0000 00:00:1776870386.599572 442045 cuda_platform.cc:52] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
diagonal flow bijectors : 8
coupling flow bijectors : 12
diagonal flow trainable params : 208
coupling flow trainable params : 15312
4. Element-wise equality check¶
Theorem says for every . Evaluate both on the whole training set and compute:
- per-sample error
- per-sample log-prob difference
Expected: float32 precision on the bulk (). A handful of tail points drift up to from accumulated float error through the multi-block composition.
z_diag = ops.convert_to_numpy(flow_diag(ops.convert_to_tensor(X)))
z_cpl = ops.convert_to_numpy(flow_cpl(ops.convert_to_tensor(X)))
err = np.linalg.norm(z_cpl - z_diag, axis=-1)
lp_diag = ops.convert_to_numpy(flow_diag.log_prob(ops.convert_to_tensor(X)))
lp_cpl = ops.convert_to_numpy(flow_cpl.log_prob(ops.convert_to_tensor(X)))
lp_err = np.abs(lp_cpl - lp_diag)
print("pushforward difference ||z_cpl - z_diag||_2")
print(f" median : {np.median(err):.2e}")
print(f" 95th% : {np.percentile(err, 95):.2e}")
print(f" max : {err.max():.2e}")
print()
print("log-prob difference |log p_cpl(x) - log p_diag(x)|")
print(f" median : {np.median(lp_err):.2e}")
print(f" 95th% : {np.percentile(lp_err, 95):.2e}")
print(f" max : {lp_err.max():.2e}")pushforward difference ||z_cpl - z_diag||_2
median : 3.63e-07
95th% : 2.48e-06
max : 1.26e-02
log-prob difference |log p_cpl(x) - log p_diag(x)|
median : 7.15e-07
95th% : 5.72e-06
max : 1.31e-03
Visualise the pushforwards overlaid¶
If the two flows are equivalent, plotting both pushforwards in the same axes should show a single indistinguishable point cloud. The right panel is the scatter of the pointwise error vectors — should sit on the origin to within float precision.
fig, axes = plt.subplots(1, 2, figsize=(14, 7))
ax = axes[0]
ax.scatter(z_diag[:, 0], z_diag[:, 1], s=12, alpha=0.55,
color=COLOR_DIAG, label="diagonal flow")
ax.scatter(z_cpl[:, 0], z_cpl[:, 1], s=5, alpha=0.35,
color=COLOR_CPL, label="coupling flow")
ax.set_xlabel("$z_1$")
ax.set_ylabel("$z_2$")
ax.set_title("Pushforwards overlaid")
ax.set_xlim(-4, 4)
ax.set_ylim(-4, 4)
ax.legend(loc="upper left", frameon=True)
style_axes(ax, aspect="equal")
ax = axes[1]
delta = z_cpl - z_diag
lim = max(1e-5, np.max(np.abs(delta)) * 1.2)
ax.scatter(delta[:, 0], delta[:, 1], s=10, alpha=0.6, color=COLOR_DELTA)
ax.axhline(0, color="black", linewidth=0.8, alpha=0.6)
ax.axvline(0, color="black", linewidth=0.8, alpha=0.6)
ax.set_xlim(-lim, lim)
ax.set_ylim(-lim, lim)
ax.set_xlabel("$\\Delta z_1$")
ax.set_ylabel("$\\Delta z_2$")
ax.set_title("$\\Delta z = z_{\\mathrm{cpl}} - z_{\\mathrm{diag}}$")
style_axes(ax, aspect="equal")
plt.show()
5. Training breaks the equivalence¶
One gradient step of the coupling flow sets its final-Dense kernels to nonzero values. The conditioner stops being constant-in-, and the layer becomes a genuinely conditional transform. From that point on, diverges from .
Train only the coupling flow (the diagonal flow stays frozen as the reference). Re-check the pointwise error after a handful of epochs.
flow_cpl.compile(optimizer=keras.optimizers.Adam(3e-3), loss=base_nll_loss)
hist = flow_cpl.fit(X, X, batch_size=512, epochs=20, verbose=0)
z_cpl_trained = ops.convert_to_numpy(flow_cpl(ops.convert_to_tensor(X)))
err_trained = np.linalg.norm(z_cpl_trained - z_diag, axis=-1)
lp_cpl_trained = ops.convert_to_numpy(flow_cpl.log_prob(ops.convert_to_tensor(X)))
lp_err_trained = np.abs(lp_cpl_trained - lp_diag)
print("after 20 epochs of coupling-flow training:")
print()
print("pushforward difference ||z_cpl - z_diag||_2")
print(f" median : {np.median(err_trained):.2e}")
print(f" 95th% : {np.percentile(err_trained, 95):.2e}")
print(f" max : {err_trained.max():.2e}")
print()
print("log-prob difference |log p_cpl(x) - log p_diag(x)|")
print(f" median : {np.median(lp_err_trained):.2e}")
print(f" 95th% : {np.percentile(lp_err_trained, 95):.2e}")
print(f" max : {lp_err_trained.max():.2e}")
print()
print(f"mean log-likelihood diagonal (frozen) = {lp_diag.mean():+.3f} nats/sample")
print(f"mean log-likelihood coupling (trained) = {lp_cpl_trained.mean():+.3f} nats/sample")after 20 epochs of coupling-flow training:
pushforward difference ||z_cpl - z_diag||_2
median : 1.99e-01
95th% : 6.11e-01
max : 2.41e+00
log-prob difference |log p_cpl(x) - log p_diag(x)|
median : 8.34e-01
95th% : 2.11e+00
max : 7.66e+00
mean log-likelihood diagonal (frozen) = -1.827 nats/sample
mean log-likelihood coupling (trained) = -1.122 nats/sample
Visualise the divergence¶
Same two panels, after training. Now the overlay shows two visibly different point clouds, and fills a non-trivial region of the plane — the conditioner has learnt cross-dim structure the diagonal flow cannot express.
fig, axes = plt.subplots(1, 2, figsize=(14, 7))
ax = axes[0]
ax.scatter(z_diag[:, 0], z_diag[:, 1], s=12, alpha=0.55,
color=COLOR_DIAG, label="diagonal flow (frozen)")
ax.scatter(z_cpl_trained[:, 0], z_cpl_trained[:, 1], s=5, alpha=0.35,
color=COLOR_CPL, label="coupling flow (trained)")
ax.set_xlabel("$z_1$")
ax.set_ylabel("$z_2$")
ax.set_title("Pushforwards overlaid (after training)")
ax.set_xlim(-4, 4)
ax.set_ylim(-4, 4)
ax.legend(loc="upper left", frameon=True)
style_axes(ax, aspect="equal")
ax = axes[1]
delta_trained = z_cpl_trained - z_diag
lim = max(0.05, np.max(np.abs(delta_trained)) * 1.1)
ax.scatter(delta_trained[:, 0], delta_trained[:, 1], s=10, alpha=0.6, color=COLOR_DELTA)
ax.axhline(0, color="black", linewidth=0.8, alpha=0.6)
ax.axvline(0, color="black", linewidth=0.8, alpha=0.6)
ax.set_xlim(-lim, lim)
ax.set_ylim(-lim, lim)
ax.set_xlabel("$\\Delta z_1$")
ax.set_ylabel("$\\Delta z_2$")
ax.set_title("$\\Delta z$ after training — equivalence is broken")
style_axes(ax, aspect="equal")
plt.show()
Error magnitudes side-by-side¶
One bar chart comparing the pointwise error and log-prob difference, before vs after training. Log scale on the y-axis because the two regimes differ by 6+ orders of magnitude.
stats = {
"at init (W=0)": (err, lp_err),
"after 20 epochs": (err_trained, lp_err_trained),
}
labels = list(stats.keys())
err_medians = [np.median(stats[k][0]) for k in labels]
err_maxs = [stats[k][0].max() for k in labels]
lp_medians = [np.median(stats[k][1]) for k in labels]
lp_maxs = [stats[k][1].max() for k in labels]
fig, axes = plt.subplots(1, 2, figsize=(14, 5.5))
width = 0.35
x = np.arange(len(labels))
ax = axes[0]
ax.bar(x - width/2, err_medians, width, label="median", color=palette[0])
ax.bar(x + width/2, err_maxs, width, label="max", color=palette[1])
ax.set_yscale("log")
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.set_ylabel("$\\|z_\\mathrm{cpl} - z_\\mathrm{diag}\\|_2$")
ax.set_title("Pushforward error")
ax.legend(loc="upper left", frameon=True)
style_axes(ax)
ax = axes[1]
ax.bar(x - width/2, lp_medians, width, label="median", color=palette[0])
ax.bar(x + width/2, lp_maxs, width, label="max", color=palette[1])
ax.set_yscale("log")
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.set_ylabel("$|\\log p_\\mathrm{cpl}(x) - \\log p_\\mathrm{diag}(x)|$")
ax.set_title("Log-prob error")
ax.legend(loc="upper left", frameon=True)
style_axes(ax)
plt.show()
Takeaways¶
- At zero-kernel init the coupling flow is a reparameterisation of a diagonal Gaussianization flow, which itself is the parametric form of RBIG. Three different architectures, one function — empirically equal to float precision.
- The only freedom unused at init is the conditioner’s final kernel . Training lets that kernel grow, which is exactly what turns a diagonal Gaussianization into a genuinely conditional coupling flow.
- Practical implication: when using
make_coupling_flow+initialize_flow_from_ig, you can think of the first forward pass as “here is RBIG” and of training as “now refine RBIG with cross-dim modulation”. The warm-start doesn’t constrain the eventual model — only the starting point.