Sequential Probabilistic Inference Schema
This is the sequential (state-space) companion to the
Full Probabilistic Inference Schema . We now have a sequence of
time steps t = 1 , … , T t = 1, \ldots, T t = 1 , … , T . The state u t \boldsymbol{u}_t u t evolves according to
a dynamical model (the transition distribution), and observations
y t \boldsymbol{y}_t y t are generated from the current state u t \boldsymbol{u}_t u t via the
observation operator H \mathbf{H} H .
The same two model tracks carry over — a simulator (state evolves directly)
and an emulator (a latent state z t \boldsymbol{z}_t z t evolves in a compressed
space) — and the same three inference regimes (exact, variational, amortized),
now joined by the classical recursive filtering and smoothing
algorithms.
Notation ¶ Table 1: Symbols used throughout this note.
Symbol Space Meaning y t \boldsymbol{y}_t y t R D y \mathbb{R}^{D_y} R D y observation at time t t t (gappy, noisy) u t \boldsymbol{u}_t u t R D u \mathbb{R}^{D_u} R D u full state at time t t t z t \boldsymbol{z}_t z t R D z \mathbb{R}^{D_z} R D z emulator latent state at time t t t (emulator track) θ \boldsymbol{\theta} θ R D θ \mathbb{R}^{D_\theta} R D θ parameters — static , do not evolve in time x t \boldsymbol{x}_t x t R D x \mathbb{R}^{D_x} R D x covariates / controls at time t t t ψ \boldsymbol{\psi} ψ R D ψ \mathbb{R}^{D_\psi} R D ψ inference (variational) parameters u 1 : T \boldsymbol{u}_{1:T} u 1 : T — the full trajectory ( u 1 , … , u T ) (\boldsymbol{u}_1, \ldots, \boldsymbol{u}_T) ( u 1 , … , u T )
D ∙ D_\bullet D ∙ denotes the dimensionality of an object (as in the static note).
We write a 1 : t ≡ ( a 1 , … , a t ) \boldsymbol{a}_{1:t} \equiv (\boldsymbol{a}_1, \ldots, \boldsymbol{a}_t) a 1 : t ≡ ( a 1 , … , a t )
for a sub-trajectory. New sequential counts: T T T is the number of time steps ,
N e N_e N e the ensemble size , N p N_p N p the number of particles . Capital N , M N, M N , M
remain reserved for sample counts — here a dataset of N N N sequences that
amortized inference generalises over.
Sequential Generative Model ¶ The joint distribution over the full sequence factorises as
p ( y 1 : T , u 1 : T , θ ∣ x 1 : T ) = p ( θ ∣ x 1 : T ) ⏟ param prior p ( u 0 ∣ θ , x 0 ) ⏟ initial prior ∏ t = 1 T p ( y t ∣ u t , θ , x t ) ⏟ observation p ( u t ∣ u t − 1 , θ , x t ) ⏟ transition p(\boldsymbol{y}_{1:T}, \boldsymbol{u}_{1:T}, \boldsymbol{\theta} \mid \boldsymbol{x}_{1:T}) =
\underbrace{p(\boldsymbol{\theta} \mid \boldsymbol{x}_{1:T})}_{\text{param prior}} \,
\underbrace{p(\boldsymbol{u}_0 \mid \boldsymbol{\theta}, \boldsymbol{x}_0)}_{\text{initial prior}}
\prod_{t=1}^{T}
\underbrace{p(\boldsymbol{y}_t \mid \boldsymbol{u}_t, \boldsymbol{\theta}, \boldsymbol{x}_t)}_{\text{observation}} \,
\underbrace{p(\boldsymbol{u}_t \mid \boldsymbol{u}_{t-1}, \boldsymbol{\theta}, \boldsymbol{x}_t)}_{\text{transition}} p ( y 1 : T , u 1 : T , θ ∣ x 1 : T ) = param prior p ( θ ∣ x 1 : T ) initial prior p ( u 0 ∣ θ , x 0 ) t = 1 ∏ T observation p ( y t ∣ u t , θ , x t ) transition p ( u t ∣ u t − 1 , θ , x t ) Components ¶ Initial state prior — p ( u 0 ∣ θ , x 0 ) p(\boldsymbol{u}_0 \mid \boldsymbol{\theta}, \boldsymbol{x}_0) p ( u 0 ∣ θ , x 0 ) : distribution over the state before any observations.
Transition model (dynamics) — p ( u t ∣ u t − 1 , θ , x t ) p(\boldsymbol{u}_t \mid \boldsymbol{u}_{t-1}, \boldsymbol{\theta}, \boldsymbol{x}_t) p ( u t ∣ u t − 1 , θ , x t ) : how the state evolves from t − 1 t-1 t − 1 to t t t . May be a physical simulator, a learned emulator, or both. x t \boldsymbol{x}_t x t carries the forcing/controls at time t t t ; θ \boldsymbol{\theta} θ governs the dynamics (diffusion, advection, …).
Observation model — p ( y t ∣ u t , θ , x t ) p(\boldsymbol{y}_t \mid \boldsymbol{u}_t, \boldsymbol{\theta}, \boldsymbol{x}_t) p ( y t ∣ u t , θ , x t ) : how the state generates observations, the same operator H \mathbf{H} H applied at each t t t . Observations are gappy and noisy at every step.
Parameter prior — p ( θ ∣ x 1 : T ) p(\boldsymbol{\theta} \mid \boldsymbol{x}_{1:T}) p ( θ ∣ x 1 : T ) : static parameters shared across all time steps.
Graphical Structure ¶ u t \boldsymbol{u}_t u t depends on u t − 1 , θ , x t \boldsymbol{u}_{t-1}, \boldsymbol{\theta}, \boldsymbol{x}_t u t − 1 , θ , x t — a Markov transition.
y t \boldsymbol{y}_t y t depends on u t , θ , x t \boldsymbol{u}_t, \boldsymbol{\theta}, \boldsymbol{x}_t u t , θ , x t — conditional independence given u t \boldsymbol{u}_t u t .
θ \boldsymbol{\theta} θ is shared across all time steps (a plate over t t t ); x t \boldsymbol{x}_t x t is observed at each step.
Target Posteriors ¶ Filtering posterior (online, causal) — state at t t t given observations up to and including t t t ; uses no future observations and is updated recursively as each y t \boldsymbol{y}_t y t arrives:
p ( u t ∣ y 1 : t , x 1 : t , θ ) p(\boldsymbol{u}_t \mid \boldsymbol{y}_{1:t}, \boldsymbol{x}_{1:t}, \boldsymbol{\theta}) p ( u t ∣ y 1 : t , x 1 : t , θ ) Smoothing posterior (offline, non-causal) — state at t t t given all observations, including future ones; requires the full sequence y 1 : T \boldsymbol{y}_{1:T} y 1 : T :
p ( u t ∣ y 1 : T , x 1 : T , θ ) p(\boldsymbol{u}_t \mid \boldsymbol{y}_{1:T}, \boldsymbol{x}_{1:T}, \boldsymbol{\theta}) p ( u t ∣ y 1 : T , x 1 : T , θ ) Prediction posterior — state k k k steps ahead given observations up to t t t (none from t + 1 t+1 t + 1 onward):
p ( u t + k ∣ y 1 : t , x 1 : t + k , θ ) p(\boldsymbol{u}_{t+k} \mid \boldsymbol{y}_{1:t}, \boldsymbol{x}_{1:t+k}, \boldsymbol{\theta}) p ( u t + k ∣ y 1 : t , x 1 : t + k , θ ) Parameter posterior — static parameters inferred from the full sequence, marginalising the state trajectory:
p ( θ ∣ y 1 : T , x 1 : T ) = ∫ p ( θ ∣ y 1 : T , u 1 : T , x 1 : T ) p ( u 1 : T ∣ y 1 : T , x 1 : T ) d u 1 : T p(\boldsymbol{\theta} \mid \boldsymbol{y}_{1:T}, \boldsymbol{x}_{1:T}) =
\int p(\boldsymbol{\theta} \mid \boldsymbol{y}_{1:T}, \boldsymbol{u}_{1:T}, \boldsymbol{x}_{1:T}) \,
p(\boldsymbol{u}_{1:T} \mid \boldsymbol{y}_{1:T}, \boldsymbol{x}_{1:T}) \, \mathrm{d}\boldsymbol{u}_{1:T} p ( θ ∣ y 1 : T , x 1 : T ) = ∫ p ( θ ∣ y 1 : T , u 1 : T , x 1 : T ) p ( u 1 : T ∣ y 1 : T , x 1 : T ) d u 1 : T Joint smoothing posterior — the full joint over all states and parameters:
p ( u 1 : T , θ ∣ y 1 : T , x 1 : T ) p(\boldsymbol{u}_{1:T}, \boldsymbol{\theta} \mid \boldsymbol{y}_{1:T}, \boldsymbol{x}_{1:T}) p ( u 1 : T , θ ∣ y 1 : T , x 1 : T ) The smoothing posterior conditions on strictly more information than the
filtering posterior, so it is never worse:
Filtering ⪯ Smoothing \text{Filtering} \;\preceq\; \text{Smoothing} Filtering ⪯ Smoothing . The price is that smoothing is
offline — it needs the whole sequence before it can report the state at any
interior time t t t .
Track 1 — Simulator ¶ The simulator generative model is (1) above: the state
u t \boldsymbol{u}_t u t evolves directly, with no internal latent variable.
1A · Exact Posteriors ¶ 1B · Filtering Algorithms ¶ Recursive algorithms that process observations one at a time and maintain a
running approximation to the filtering posterior.
Linear-Gaussian — exact. Requires
p ( u t ∣ u t − 1 , θ , x t ) = N ( u t ∣ A θ u t − 1 + B θ x t , Q θ ) p ( y t ∣ u t , θ , x t ) = N ( y t ∣ H u t , R θ ) \begin{aligned}
p(\boldsymbol{u}_t \mid \boldsymbol{u}_{t-1}, \boldsymbol{\theta}, \boldsymbol{x}_t)
&= \mathcal{N}\!\left(\boldsymbol{u}_t \mid \mathbf{A}_{\boldsymbol{\theta}} \boldsymbol{u}_{t-1} + \mathbf{B}_{\boldsymbol{\theta}} \boldsymbol{x}_t, \; \mathbf{Q}_{\boldsymbol{\theta}}\right) \\
p(\boldsymbol{y}_t \mid \boldsymbol{u}_t, \boldsymbol{\theta}, \boldsymbol{x}_t)
&= \mathcal{N}\!\left(\boldsymbol{y}_t \mid \mathbf{H} \boldsymbol{u}_t, \; \mathbf{R}_{\boldsymbol{\theta}}\right)
\end{aligned} p ( u t ∣ u t − 1 , θ , x t ) p ( y t ∣ u t , θ , x t ) = N ( u t ∣ A θ u t − 1 + B θ x t , Q θ ) = N ( y t ∣ H u t , R θ ) Predict — push the previous filter through the dynamics:
p ( u t ∣ y 1 : t − 1 , x 1 : t , θ ) = ∫ p ( u t ∣ u t − 1 , θ , x t ) p ( u t − 1 ∣ y 1 : t − 1 , x 1 : t − 1 , θ ) d u t − 1 p(\boldsymbol{u}_t \mid \boldsymbol{y}_{1:t-1}, \boldsymbol{x}_{1:t}, \boldsymbol{\theta}) =
\int p(\boldsymbol{u}_t \mid \boldsymbol{u}_{t-1}, \boldsymbol{\theta}, \boldsymbol{x}_t) \,
p(\boldsymbol{u}_{t-1} \mid \boldsymbol{y}_{1:t-1}, \boldsymbol{x}_{1:t-1}, \boldsymbol{\theta}) \, \mathrm{d}\boldsymbol{u}_{t-1} p ( u t ∣ y 1 : t − 1 , x 1 : t , θ ) = ∫ p ( u t ∣ u t − 1 , θ , x t ) p ( u t − 1 ∣ y 1 : t − 1 , x 1 : t − 1 , θ ) d u t − 1 Update — correct the prediction with the new observation:
p ( u t ∣ y 1 : t , x 1 : t , θ ) ∝ p ( y t ∣ u t , θ , x t ) p ( u t ∣ y 1 : t − 1 , x 1 : t , θ ) p(\boldsymbol{u}_t \mid \boldsymbol{y}_{1:t}, \boldsymbol{x}_{1:t}, \boldsymbol{\theta}) \propto
p(\boldsymbol{y}_t \mid \boldsymbol{u}_t, \boldsymbol{\theta}, \boldsymbol{x}_t) \,
p(\boldsymbol{u}_t \mid \boldsymbol{y}_{1:t-1}, \boldsymbol{x}_{1:t}, \boldsymbol{\theta}) p ( u t ∣ y 1 : t , x 1 : t , θ ) ∝ p ( y t ∣ u t , θ , x t ) p ( u t ∣ y 1 : t − 1 , x 1 : t , θ ) Both steps are exact and closed-form for linear-Gaussian models.
Nonlinear — approximate. Linearise the transition and observation models
around the current state estimate (first-order Taylor / Jacobians), then run the
same predict–update structure (8) –(9) . Approximate
because of the linearisation; degrades when the dynamics are strongly nonlinear.
Nonlinear — Monte-Carlo approximate. Represent the filtering distribution as
an ensemble of N e N_e N e members, propagate each through the dynamics, and apply a
Kalman-style correction built from sample covariances — no parametric form
assumed.
Ensemble: { u t ( i ) } i = 1 N e Predict: u t ( i ) ∼ p ( u t ∣ u t − 1 ( i ) , θ , x t ) Update: u t ( i ) ← u t ( i ) + K t ( y t − H u t ( i ) ) \begin{aligned}
\text{Ensemble:} &\quad \{\boldsymbol{u}_t^{(i)}\}_{i=1}^{N_e} \\
\text{Predict:} &\quad \boldsymbol{u}_t^{(i)} \sim p(\boldsymbol{u}_t \mid \boldsymbol{u}_{t-1}^{(i)}, \boldsymbol{\theta}, \boldsymbol{x}_t) \\
\text{Update:} &\quad \boldsymbol{u}_t^{(i)} \leftarrow \boldsymbol{u}_t^{(i)} + \mathbf{K}_t\!\left(\boldsymbol{y}_t - \mathbf{H}\boldsymbol{u}_t^{(i)}\right)
\end{aligned} Ensemble: Predict: Update: { u t ( i ) } i = 1 N e u t ( i ) ∼ p ( u t ∣ u t − 1 ( i ) , θ , x t ) u t ( i ) ← u t ( i ) + K t ( y t − H u t ( i ) ) where K t \mathbf{K}_t K t is the ensemble Kalman gain.
Nonlinear — Monte-Carlo, exact in the limit. Represent the filter as
weighted particles, propagate them, reweight by the likelihood, and resample to
avoid weight degeneracy.
Particles: { u t ( i ) , w t ( i ) } i = 1 N p Predict: u t ( i ) ∼ p ( u t ∣ u t − 1 ( i ) , θ , x t ) Update: w t ( i ) ∝ w t − 1 ( i ) p ( y t ∣ u t ( i ) , θ , x t ) Resample: draw new particles according to { w t ( i ) } \begin{aligned}
\text{Particles:} &\quad \{\boldsymbol{u}_t^{(i)}, w_t^{(i)}\}_{i=1}^{N_p} \\
\text{Predict:} &\quad \boldsymbol{u}_t^{(i)} \sim p(\boldsymbol{u}_t \mid \boldsymbol{u}_{t-1}^{(i)}, \boldsymbol{\theta}, \boldsymbol{x}_t) \\
\text{Update:} &\quad w_t^{(i)} \propto w_{t-1}^{(i)} \, p(\boldsymbol{y}_t \mid \boldsymbol{u}_t^{(i)}, \boldsymbol{\theta}, \boldsymbol{x}_t) \\
\text{Resample:} &\quad \text{draw new particles according to } \{w_t^{(i)}\}
\end{aligned} Particles: Predict: Update: Resample: { u t ( i ) , w t ( i ) } i = 1 N p u t ( i ) ∼ p ( u t ∣ u t − 1 ( i ) , θ , x t ) w t ( i ) ∝ w t − 1 ( i ) p ( y t ∣ u t ( i ) , θ , x t ) draw new particles according to { w t ( i ) } 1C · Smoothing Algorithms ¶ Offline algorithms that use the full sequence.
Kalman / RTS Smoother (linear-Gaussian — exact). Run the Kalman filter forward, then a backward pass to fold in future observations:
p ( u t ∣ y 1 : T , x 1 : T , θ ) = p ( u t ∣ y 1 : t , x 1 : t , θ ) ∫ p ( u t + 1 ∣ u t , θ , x t + 1 ) p ( u t + 1 ∣ y 1 : T , x 1 : T , θ ) p ( u t + 1 ∣ y 1 : t , x 1 : t + 1 , θ ) d u t + 1 p(\boldsymbol{u}_t \mid \boldsymbol{y}_{1:T}, \boldsymbol{x}_{1:T}, \boldsymbol{\theta}) =
p(\boldsymbol{u}_t \mid \boldsymbol{y}_{1:t}, \boldsymbol{x}_{1:t}, \boldsymbol{\theta})
\int \frac{p(\boldsymbol{u}_{t+1} \mid \boldsymbol{u}_t, \boldsymbol{\theta}, \boldsymbol{x}_{t+1}) \,
p(\boldsymbol{u}_{t+1} \mid \boldsymbol{y}_{1:T}, \boldsymbol{x}_{1:T}, \boldsymbol{\theta})}
{p(\boldsymbol{u}_{t+1} \mid \boldsymbol{y}_{1:t}, \boldsymbol{x}_{1:t+1}, \boldsymbol{\theta})} \,
\mathrm{d}\boldsymbol{u}_{t+1} p ( u t ∣ y 1 : T , x 1 : T , θ ) = p ( u t ∣ y 1 : t , x 1 : t , θ ) ∫ p ( u t + 1 ∣ y 1 : t , x 1 : t + 1 , θ ) p ( u t + 1 ∣ u t , θ , x t + 1 ) p ( u t + 1 ∣ y 1 : T , x 1 : T , θ ) d u t + 1 Particle Smoother (nonlinear — Monte-Carlo). Run the particle filter forward, then a backward pass that reweights particles using future information.
Variational Smoother — see §1D below.
1D · Variational Inference ¶ For nonlinear / non-Gaussian models where filtering and smoothing are too
expensive or unavailable, introduce a variational distribution over the full
state sequence (and parameters). Here ψ \boldsymbol{\psi} ψ is optimised once per
observed sequence — no generalisation.
Filtering variational posterior — maintained recursively, with
ψ t \boldsymbol{\psi}_t ψ t updated at each step as new y t \boldsymbol{y}_t y t arrives:
q ( u t ∣ y 1 : t , x 1 : t , ψ t ) ≈ p ( u t ∣ y 1 : t , x 1 : t , θ ) q(\boldsymbol{u}_t \mid \boldsymbol{y}_{1:t}, \boldsymbol{x}_{1:t}, \boldsymbol{\psi}_t) \approx
p(\boldsymbol{u}_t \mid \boldsymbol{y}_{1:t}, \boldsymbol{x}_{1:t}, \boldsymbol{\theta}) q ( u t ∣ y 1 : t , x 1 : t , ψ t ) ≈ p ( u t ∣ y 1 : t , x 1 : t , θ ) Smoothing variational posterior — two ways to structure the family:
q ( u 1 : T ∣ ψ ) = ∏ t = 1 T q ( u t ∣ ψ t ) ≈ p ( u 1 : T ∣ y 1 : T , x 1 : T , θ ) q(\boldsymbol{u}_{1:T} \mid \boldsymbol{\psi}) = \prod_{t=1}^{T} q(\boldsymbol{u}_t \mid \boldsymbol{\psi}_t)
\approx p(\boldsymbol{u}_{1:T} \mid \boldsymbol{y}_{1:T}, \boldsymbol{x}_{1:T}, \boldsymbol{\theta}) q ( u 1 : T ∣ ψ ) = t = 1 ∏ T q ( u t ∣ ψ t ) ≈ p ( u 1 : T ∣ y 1 : T , x 1 : T , θ ) L ( ψ ) = ∑ t = 1 T E q ( u t ∣ ψ t ) [ log p ( y t ∣ u t , θ , x t ) ] − ∑ t = 1 T D K L [ q ( u t ∣ ψ t ) ∥ p ( u t ∣ u t − 1 , θ , x t ) ] \mathcal{L}(\boldsymbol{\psi}) =
\sum_{t=1}^{T} \mathbb{E}_{q(\boldsymbol{u}_t \mid \boldsymbol{\psi}_t)}\!\left[ \log p(\boldsymbol{y}_t \mid \boldsymbol{u}_t, \boldsymbol{\theta}, \boldsymbol{x}_t) \right]
- \sum_{t=1}^{T} D_{\mathrm{KL}}\!\left[\, q(\boldsymbol{u}_t \mid \boldsymbol{\psi}_t) \,\|\, p(\boldsymbol{u}_t \mid \boldsymbol{u}_{t-1}, \boldsymbol{\theta}, \boldsymbol{x}_t) \,\right] L ( ψ ) = t = 1 ∑ T E q ( u t ∣ ψ t ) [ log p ( y t ∣ u t , θ , x t ) ] − t = 1 ∑ T D KL [ q ( u t ∣ ψ t ) ∥ p ( u t ∣ u t − 1 , θ , x t ) ] The factored family breaks the temporal dependencies of the true smoothing
posterior — a strong approximation.
q ( u 1 : T ∣ ψ ) = q ( u 0 ∣ ψ 0 ) ∏ t = 1 T q ( u t ∣ u t − 1 , ψ t ) q(\boldsymbol{u}_{1:T} \mid \boldsymbol{\psi}) = q(\boldsymbol{u}_0 \mid \boldsymbol{\psi}_0) \prod_{t=1}^{T} q(\boldsymbol{u}_t \mid \boldsymbol{u}_{t-1}, \boldsymbol{\psi}_t) q ( u 1 : T ∣ ψ ) = q ( u 0 ∣ ψ 0 ) t = 1 ∏ T q ( u t ∣ u t − 1 , ψ t ) L ( ψ ) = ∑ t = 1 T E q ( u 1 : T ∣ ψ ) [ log p ( y t ∣ u t , θ , x t ) ] − D K L [ q ( u 1 : T ∣ ψ ) ∥ p ( u 1 : T ∣ θ , x 1 : T ) ] \mathcal{L}(\boldsymbol{\psi}) =
\sum_{t=1}^{T} \mathbb{E}_{q(\boldsymbol{u}_{1:T} \mid \boldsymbol{\psi})}\!\left[ \log p(\boldsymbol{y}_t \mid \boldsymbol{u}_t, \boldsymbol{\theta}, \boldsymbol{x}_t) \right]
- D_{\mathrm{KL}}\!\left[\, q(\boldsymbol{u}_{1:T} \mid \boldsymbol{\psi}) \,\|\, p(\boldsymbol{u}_{1:T} \mid \boldsymbol{\theta}, \boldsymbol{x}_{1:T}) \,\right] L ( ψ ) = t = 1 ∑ T E q ( u 1 : T ∣ ψ ) [ log p ( y t ∣ u t , θ , x t ) ] − D KL [ q ( u 1 : T ∣ ψ ) ∥ p ( u 1 : T ∣ θ , x 1 : T ) ] The structured family preserves the Markov structure of the generative model —
strictly more expressive than factored.
Joint smoothing + parameter inference — Hierarchical:
q ( u 1 : T , θ ∣ ψ ) = q ( u 1 : T ∣ θ , ψ u ) q ( θ ∣ ψ θ ) q(\boldsymbol{u}_{1:T}, \boldsymbol{\theta} \mid \boldsymbol{\psi}) =
q(\boldsymbol{u}_{1:T} \mid \boldsymbol{\theta}, \boldsymbol{\psi}_u) \, q(\boldsymbol{\theta} \mid \boldsymbol{\psi}_\theta) q ( u 1 : T , θ ∣ ψ ) = q ( u 1 : T ∣ θ , ψ u ) q ( θ ∣ ψ θ ) L ( ψ ) = E q ( θ ∣ ψ θ ) [ ∑ t = 1 T E q ( u t ∣ u t − 1 , θ , ψ u ) [ log p ( y t ∣ u t , θ , x t ) ] − D K L [ q ( u 1 : T ∣ θ , ψ u ) ∥ p ( u 1 : T ∣ θ , x 1 : T ) ] ] − D K L [ q ( θ ∣ ψ θ ) ∥ p ( θ ∣ x 1 : T ) ] \mathcal{L}(\boldsymbol{\psi}) =
\mathbb{E}_{q(\boldsymbol{\theta} \mid \boldsymbol{\psi}_\theta)}\!\left[
\sum_{t=1}^{T} \mathbb{E}_{q(\boldsymbol{u}_t \mid \boldsymbol{u}_{t-1}, \boldsymbol{\theta}, \boldsymbol{\psi}_u)}\!\left[ \log p(\boldsymbol{y}_t \mid \boldsymbol{u}_t, \boldsymbol{\theta}, \boldsymbol{x}_t) \right]
- D_{\mathrm{KL}}\!\left[\, q(\boldsymbol{u}_{1:T} \mid \boldsymbol{\theta}, \boldsymbol{\psi}_u) \,\|\, p(\boldsymbol{u}_{1:T} \mid \boldsymbol{\theta}, \boldsymbol{x}_{1:T}) \,\right]
\right]
- D_{\mathrm{KL}}\!\left[\, q(\boldsymbol{\theta} \mid \boldsymbol{\psi}_\theta) \,\|\, p(\boldsymbol{\theta} \mid \boldsymbol{x}_{1:T}) \,\right] L ( ψ ) = E q ( θ ∣ ψ θ ) [ t = 1 ∑ T E q ( u t ∣ u t − 1 , θ , ψ u ) [ log p ( y t ∣ u t , θ , x t ) ] − D KL [ q ( u 1 : T ∣ θ , ψ u ) ∥ p ( u 1 : T ∣ θ , x 1 : T ) ] ] − D KL [ q ( θ ∣ ψ θ ) ∥ p ( θ ∣ x 1 : T ) ] 1E · Amortized Inference ¶ Train a network once on many sequences; at test time a single forward pass
over a new sequence gives the posterior — no per-sequence optimisation.
Amortized filtering — a recurrent network (RNN, LSTM, S4, Mamba) processes
y 1 : t \boldsymbol{y}_{1:t} y 1 : t sequentially and emits a distribution over
u t \boldsymbol{u}_t u t ; causal, generalises across sequences:
q ( u t ∣ y 1 : t , x 1 : t , ψ ) q(\boldsymbol{u}_t \mid \boldsymbol{y}_{1:t}, \boldsymbol{x}_{1:t}, \boldsymbol{\psi}) q ( u t ∣ y 1 : t , x 1 : t , ψ ) Amortized smoothing — an encoder that reads the full sequence (transformer,
bidirectional RNN) and emits a distribution over u t \boldsymbol{u}_t u t at each step;
non-causal, uses past and future observations:
q ( u t ∣ y 1 : T , x 1 : T , ψ ) q(\boldsymbol{u}_t \mid \boldsymbol{y}_{1:T}, \boldsymbol{x}_{1:T}, \boldsymbol{\psi}) q ( u t ∣ y 1 : T , x 1 : T , ψ ) Amortized joint — Hierarchical:
q ( u 1 : T , θ ∣ y 1 : T , x 1 : T , ψ ) = q ( u 1 : T ∣ y 1 : T , x 1 : T , θ , ψ u ) q ( θ ∣ y 1 : T , x 1 : T , ψ θ ) q(\boldsymbol{u}_{1:T}, \boldsymbol{\theta} \mid \boldsymbol{y}_{1:T}, \boldsymbol{x}_{1:T}, \boldsymbol{\psi}) =
q(\boldsymbol{u}_{1:T} \mid \boldsymbol{y}_{1:T}, \boldsymbol{x}_{1:T}, \boldsymbol{\theta}, \boldsymbol{\psi}_u) \,
q(\boldsymbol{\theta} \mid \boldsymbol{y}_{1:T}, \boldsymbol{x}_{1:T}, \boldsymbol{\psi}_\theta) q ( u 1 : T , θ ∣ y 1 : T , x 1 : T , ψ ) = q ( u 1 : T ∣ y 1 : T , x 1 : T , θ , ψ u ) q ( θ ∣ y 1 : T , x 1 : T , ψ θ ) L ( ψ ) = E q ( θ ∣ y , x , ψ θ ) [ ∑ t = 1 T E q ( u t ∣ y , x , θ , ψ u ) [ log p ( y t ∣ u t , θ , x t ) ] − D K L [ q ( u 1 : T ∣ y , x , θ , ψ u ) ∥ p ( u 1 : T ∣ θ , x 1 : T ) ] ] − D K L [ q ( θ ∣ y , x , ψ θ ) ∥ p ( θ ∣ x 1 : T ) ] \mathcal{L}(\boldsymbol{\psi}) =
\mathbb{E}_{q(\boldsymbol{\theta} \mid \boldsymbol{y}, \boldsymbol{x}, \boldsymbol{\psi}_\theta)}\!\left[
\sum_{t=1}^{T} \mathbb{E}_{q(\boldsymbol{u}_t \mid \boldsymbol{y}, \boldsymbol{x}, \boldsymbol{\theta}, \boldsymbol{\psi}_u)}\!\left[ \log p(\boldsymbol{y}_t \mid \boldsymbol{u}_t, \boldsymbol{\theta}, \boldsymbol{x}_t) \right]
- D_{\mathrm{KL}}\!\left[\, q(\boldsymbol{u}_{1:T} \mid \boldsymbol{y}, \boldsymbol{x}, \boldsymbol{\theta}, \boldsymbol{\psi}_u) \,\|\, p(\boldsymbol{u}_{1:T} \mid \boldsymbol{\theta}, \boldsymbol{x}_{1:T}) \,\right]
\right]
- D_{\mathrm{KL}}\!\left[\, q(\boldsymbol{\theta} \mid \boldsymbol{y}, \boldsymbol{x}, \boldsymbol{\psi}_\theta) \,\|\, p(\boldsymbol{\theta} \mid \boldsymbol{x}_{1:T}) \,\right] L ( ψ ) = E q ( θ ∣ y , x , ψ θ ) [ t = 1 ∑ T E q ( u t ∣ y , x , θ , ψ u ) [ log p ( y t ∣ u t , θ , x t ) ] − D KL [ q ( u 1 : T ∣ y , x , θ , ψ u ) ∥ p ( u 1 : T ∣ θ , x 1 : T ) ] ] − D KL [ q ( θ ∣ y , x , ψ θ ) ∥ p ( θ ∣ x 1 : T ) ] Track 2 — Emulator ¶ Generative Model ¶ The emulator introduces an internal latent z t \boldsymbol{z}_t z t at each time step.
The transition now operates in latent space and decodes to u t \boldsymbol{u}_t u t .
p ( y 1 : T , u 1 : T , z 1 : T , θ ∣ x 1 : T ) = p ( θ ∣ x 1 : T ) p ( z 0 ∣ θ , x 0 ) p ( u 0 ∣ z 0 , θ , x 0 ) ∏ t = 1 T p ( y t ∣ u t , θ , x t ) p ( u t ∣ z t , θ , x t ) p ( z t ∣ z t − 1 , θ , x t ) p(\boldsymbol{y}_{1:T}, \boldsymbol{u}_{1:T}, \boldsymbol{z}_{1:T}, \boldsymbol{\theta} \mid \boldsymbol{x}_{1:T}) =
p(\boldsymbol{\theta} \mid \boldsymbol{x}_{1:T}) \,
p(\boldsymbol{z}_0 \mid \boldsymbol{\theta}, \boldsymbol{x}_0) \,
p(\boldsymbol{u}_0 \mid \boldsymbol{z}_0, \boldsymbol{\theta}, \boldsymbol{x}_0)
\prod_{t=1}^{T}
p(\boldsymbol{y}_t \mid \boldsymbol{u}_t, \boldsymbol{\theta}, \boldsymbol{x}_t) \,
p(\boldsymbol{u}_t \mid \boldsymbol{z}_t, \boldsymbol{\theta}, \boldsymbol{x}_t) \,
p(\boldsymbol{z}_t \mid \boldsymbol{z}_{t-1}, \boldsymbol{\theta}, \boldsymbol{x}_t) p ( y 1 : T , u 1 : T , z 1 : T , θ ∣ x 1 : T ) = p ( θ ∣ x 1 : T ) p ( z 0 ∣ θ , x 0 ) p ( u 0 ∣ z 0 , θ , x 0 ) t = 1 ∏ T p ( y t ∣ u t , θ , x t ) p ( u t ∣ z t , θ , x t ) p ( z t ∣ z t − 1 , θ , x t ) The latent dynamics p ( z t ∣ z t − 1 , θ , x t ) p(\boldsymbol{z}_t \mid \boldsymbol{z}_{t-1}, \boldsymbol{\theta}, \boldsymbol{x}_t) p ( z t ∣ z t − 1 , θ , x t )
operate in the compressed space R D z \mathbb{R}^{D_z} R D z (D z ≪ D u D_z \ll D_u D z ≪ D u ); the decoder
p ( u t ∣ z t , θ , x t ) p(\boldsymbol{u}_t \mid \boldsymbol{z}_t, \boldsymbol{\theta}, \boldsymbol{x}_t) p ( u t ∣ z t , θ , x t )
maps back to the full field at each step.
z t \boldsymbol{z}_t z t evolves via learned latent dynamics (transition in z \boldsymbol{z} z -space).
u t \boldsymbol{u}_t u t is decoded from z t \boldsymbol{z}_t z t at each step (no direct u \boldsymbol{u} u -to-u \boldsymbol{u} u transition).
y t \boldsymbol{y}_t y t is observed from u t \boldsymbol{u}_t u t via H \mathbf{H} H .
2.0 · Emulator Training ¶ Train the emulator on simulator output sequences { u 1 : T , x 1 : T , θ } \{\boldsymbol{u}_{1:T}, \boldsymbol{x}_{1:T}, \boldsymbol{\theta}\} { u 1 : T , x 1 : T , θ } ;
it learns latent dynamics in z \boldsymbol{z} z -space. This introduces an
encoder q ( z t ∣ u t , x t , θ , ψ e m ) q(\boldsymbol{z}_t \mid \boldsymbol{u}_t, \boldsymbol{x}_t, \boldsymbol{\theta}, \boldsymbol{\psi}_{\mathrm{em}}) q ( z t ∣ u t , x t , θ , ψ em ) ,
decoder p ( u t ∣ z t , θ , x t ) p(\boldsymbol{u}_t \mid \boldsymbol{z}_t, \boldsymbol{\theta}, \boldsymbol{x}_t) p ( u t ∣ z t , θ , x t ) ,
and transition p ( z t ∣ z t − 1 , θ , x t ) p(\boldsymbol{z}_t \mid \boldsymbol{z}_{t-1}, \boldsymbol{\theta}, \boldsymbol{x}_t) p ( z t ∣ z t − 1 , θ , x t ) .
L e m ( θ , ψ e m ) = ∑ t = 1 T E q ( z t ∣ u t , x t , θ , ψ e m ) [ log p ( u t ∣ z t , θ , x t ) ] − ∑ t = 1 T D K L [ q ( z t ∣ u t , x t , θ , ψ e m ) ∥ p ( z t ∣ z t − 1 , θ , x t ) ] \mathcal{L}_{\mathrm{em}}(\boldsymbol{\theta}, \boldsymbol{\psi}_{\mathrm{em}}) =
\sum_{t=1}^{T} \mathbb{E}_{q(\boldsymbol{z}_t \mid \boldsymbol{u}_t, \boldsymbol{x}_t, \boldsymbol{\theta}, \boldsymbol{\psi}_{\mathrm{em}})}\!\left[ \log p(\boldsymbol{u}_t \mid \boldsymbol{z}_t, \boldsymbol{\theta}, \boldsymbol{x}_t) \right]
- \sum_{t=1}^{T} D_{\mathrm{KL}}\!\left[\, q(\boldsymbol{z}_t \mid \boldsymbol{u}_t, \boldsymbol{x}_t, \boldsymbol{\theta}, \boldsymbol{\psi}_{\mathrm{em}}) \,\|\, p(\boldsymbol{z}_t \mid \boldsymbol{z}_{t-1}, \boldsymbol{\theta}, \boldsymbol{x}_t) \,\right] L em ( θ , ψ em ) = t = 1 ∑ T E q ( z t ∣ u t , x t , θ , ψ em ) [ log p ( u t ∣ z t , θ , x t ) ] − t = 1 ∑ T D KL [ q ( z t ∣ u t , x t , θ , ψ em ) ∥ p ( z t ∣ z t − 1 , θ , x t ) ] Equation (25) is the sequential VAE / DVBF / RSSM training
objective. After training, θ \boldsymbol{\theta} θ and ψ e m \boldsymbol{\psi}_{\mathrm{em}} ψ em
are fixed, leaving a decoder, an encoder, and a latent transition.
2.1 · Emulator Uncertainty Characterization ¶ Characterise the per-step simulator–emulator discrepancy,
p ( u t r u e , t ∣ u e m , t , x t , θ ) . p(\boldsymbol{u}_{\mathrm{true},t} \mid \boldsymbol{u}_{\mathrm{em},t}, \boldsymbol{x}_t, \boldsymbol{\theta}). p ( u true , t ∣ u em , t , x t , θ ) . 2A · Exact Posteriors ¶ 2B · Filtering Algorithms (latent space) ¶ Run the recursion in z \boldsymbol{z} z -space, then decode to u \boldsymbol{u} u -space.
If the latent transition is linear-Gaussian , run the Kalman filter in
z \boldsymbol{z} z -space and decode via p ( u t ∣ z t , θ , x t ) p(\boldsymbol{u}_t \mid \boldsymbol{z}_t, \boldsymbol{\theta}, \boldsymbol{x}_t) p ( u t ∣ z t , θ , x t ) .
Predict: p ( z t ∣ y 1 : t − 1 , x 1 : t , θ ) = ∫ p ( z t ∣ z t − 1 , θ , x t ) p ( z t − 1 ∣ y 1 : t − 1 , x 1 : t − 1 , θ ) d z t − 1 Update: p ( z t ∣ y 1 : t , x 1 : t , θ ) ∝ p ( y t ∣ z t , θ , x t ) p ( z t ∣ y 1 : t − 1 , x 1 : t , θ ) Decode: p ( u t ∣ y 1 : t , x 1 : t , θ ) = ∫ p ( u t ∣ z t , θ , x t ) p ( z t ∣ y 1 : t , x 1 : t , θ ) d z t \begin{aligned}
\text{Predict:} &\quad
p(\boldsymbol{z}_t \mid \boldsymbol{y}_{1:t-1}, \boldsymbol{x}_{1:t}, \boldsymbol{\theta}) =
\int p(\boldsymbol{z}_t \mid \boldsymbol{z}_{t-1}, \boldsymbol{\theta}, \boldsymbol{x}_t) \, p(\boldsymbol{z}_{t-1} \mid \boldsymbol{y}_{1:t-1}, \boldsymbol{x}_{1:t-1}, \boldsymbol{\theta}) \, \mathrm{d}\boldsymbol{z}_{t-1} \\
\text{Update:} &\quad
p(\boldsymbol{z}_t \mid \boldsymbol{y}_{1:t}, \boldsymbol{x}_{1:t}, \boldsymbol{\theta}) \propto
p(\boldsymbol{y}_t \mid \boldsymbol{z}_t, \boldsymbol{\theta}, \boldsymbol{x}_t) \, p(\boldsymbol{z}_t \mid \boldsymbol{y}_{1:t-1}, \boldsymbol{x}_{1:t}, \boldsymbol{\theta}) \\
\text{Decode:} &\quad
p(\boldsymbol{u}_t \mid \boldsymbol{y}_{1:t}, \boldsymbol{x}_{1:t}, \boldsymbol{\theta}) =
\int p(\boldsymbol{u}_t \mid \boldsymbol{z}_t, \boldsymbol{\theta}, \boldsymbol{x}_t) \, p(\boldsymbol{z}_t \mid \boldsymbol{y}_{1:t}, \boldsymbol{x}_{1:t}, \boldsymbol{\theta}) \, \mathrm{d}\boldsymbol{z}_t
\end{aligned} Predict: Update: Decode: p ( z t ∣ y 1 : t − 1 , x 1 : t , θ ) = ∫ p ( z t ∣ z t − 1 , θ , x t ) p ( z t − 1 ∣ y 1 : t − 1 , x 1 : t − 1 , θ ) d z t − 1 p ( z t ∣ y 1 : t , x 1 : t , θ ) ∝ p ( y t ∣ z t , θ , x t ) p ( z t ∣ y 1 : t − 1 , x 1 : t , θ ) p ( u t ∣ y 1 : t , x 1 : t , θ ) = ∫ p ( u t ∣ z t , θ , x t ) p ( z t ∣ y 1 : t , x 1 : t , θ ) d z t The ensemble lives in z \boldsymbol{z} z -space; decode each member to get the
u \boldsymbol{u} u -space ensemble. Here d e c θ \mathrm{dec}_{\boldsymbol{\theta}} dec θ is the
decoder mean.
Ensemble: { z t ( i ) } i = 1 N e Predict: z t ( i ) ∼ p ( z t ∣ z t − 1 ( i ) , θ , x t ) Update: z t ( i ) ← z t ( i ) + K t ( y t − H d e c θ ( z t ( i ) ) ) Decode: u t ( i ) = d e c θ ( z t ( i ) ) \begin{aligned}
\text{Ensemble:} &\quad \{\boldsymbol{z}_t^{(i)}\}_{i=1}^{N_e} \\
\text{Predict:} &\quad \boldsymbol{z}_t^{(i)} \sim p(\boldsymbol{z}_t \mid \boldsymbol{z}_{t-1}^{(i)}, \boldsymbol{\theta}, \boldsymbol{x}_t) \\
\text{Update:} &\quad \boldsymbol{z}_t^{(i)} \leftarrow \boldsymbol{z}_t^{(i)} + \mathbf{K}_t\!\left(\boldsymbol{y}_t - \mathbf{H}\,\mathrm{dec}_{\boldsymbol{\theta}}(\boldsymbol{z}_t^{(i)})\right) \\
\text{Decode:} &\quad \boldsymbol{u}_t^{(i)} = \mathrm{dec}_{\boldsymbol{\theta}}(\boldsymbol{z}_t^{(i)})
\end{aligned} Ensemble: Predict: Update: Decode: { z t ( i ) } i = 1 N e z t ( i ) ∼ p ( z t ∣ z t − 1 ( i ) , θ , x t ) z t ( i ) ← z t ( i ) + K t ( y t − H dec θ ( z t ( i ) ) ) u t ( i ) = dec θ ( z t ( i ) ) Particles: { z t ( i ) , w t ( i ) } i = 1 N p Predict: z t ( i ) ∼ p ( z t ∣ z t − 1 ( i ) , θ , x t ) Decode: u t ( i ) = d e c θ ( z t ( i ) ) Update: w t ( i ) ∝ w t − 1 ( i ) p ( y t ∣ u t ( i ) , θ , x t ) Resample: draw new particles according to { w t ( i ) } \begin{aligned}
\text{Particles:} &\quad \{\boldsymbol{z}_t^{(i)}, w_t^{(i)}\}_{i=1}^{N_p} \\
\text{Predict:} &\quad \boldsymbol{z}_t^{(i)} \sim p(\boldsymbol{z}_t \mid \boldsymbol{z}_{t-1}^{(i)}, \boldsymbol{\theta}, \boldsymbol{x}_t) \\
\text{Decode:} &\quad \boldsymbol{u}_t^{(i)} = \mathrm{dec}_{\boldsymbol{\theta}}(\boldsymbol{z}_t^{(i)}) \\
\text{Update:} &\quad w_t^{(i)} \propto w_{t-1}^{(i)} \, p(\boldsymbol{y}_t \mid \boldsymbol{u}_t^{(i)}, \boldsymbol{\theta}, \boldsymbol{x}_t) \\
\text{Resample:} &\quad \text{draw new particles according to } \{w_t^{(i)}\}
\end{aligned} Particles: Predict: Decode: Update: Resample: { z t ( i ) , w t ( i ) } i = 1 N p z t ( i ) ∼ p ( z t ∣ z t − 1 ( i ) , θ , x t ) u t ( i ) = dec θ ( z t ( i ) ) w t ( i ) ∝ w t − 1 ( i ) p ( y t ∣ u t ( i ) , θ , x t ) draw new particles according to { w t ( i ) } 2C · Smoothing Algorithms (latent space) ¶ Kalman / RTS Smoother in z (linear-Gaussian — exact): forward Kalman filter in z \boldsymbol{z} z -space, RTS backward pass, then decode the smoothed z 1 : T \boldsymbol{z}_{1:T} z 1 : T to u 1 : T \boldsymbol{u}_{1:T} u 1 : T .
Particle Smoother in z : forward particle filter in z \boldsymbol{z} z -space, backward reweighting pass, then decode the smoothed particles.
Variational Smoother — see §2D below.
2D · Variational Inference ¶ Filtering variational posterior — updated recursively as new y t \boldsymbol{y}_t y t arrives:
q ( z t ∣ y 1 : t , x 1 : t , ψ t ) ≈ p ( z t ∣ y 1 : t , x 1 : t , θ ) q(\boldsymbol{z}_t \mid \boldsymbol{y}_{1:t}, \boldsymbol{x}_{1:t}, \boldsymbol{\psi}_t) \approx
p(\boldsymbol{z}_t \mid \boldsymbol{y}_{1:t}, \boldsymbol{x}_{1:t}, \boldsymbol{\theta}) q ( z t ∣ y 1 : t , x 1 : t , ψ t ) ≈ p ( z t ∣ y 1 : t , x 1 : t , θ ) Smoothing variational posterior — Structured (Markov):
q ( z 1 : T ∣ ψ ) = q ( z 0 ∣ ψ 0 ) ∏ t = 1 T q ( z t ∣ z t − 1 , ψ t ) q(\boldsymbol{z}_{1:T} \mid \boldsymbol{\psi}) = q(\boldsymbol{z}_0 \mid \boldsymbol{\psi}_0) \prod_{t=1}^{T} q(\boldsymbol{z}_t \mid \boldsymbol{z}_{t-1}, \boldsymbol{\psi}_t) q ( z 1 : T ∣ ψ ) = q ( z 0 ∣ ψ 0 ) t = 1 ∏ T q ( z t ∣ z t − 1 , ψ t ) L ( ψ ) = ∑ t = 1 T E q ( z t ∣ ψ ) [ log p ( y t ∣ z t , θ , x t ) ] − D K L [ q ( z 1 : T ∣ ψ ) ∥ p ( z 1 : T ∣ θ , x 1 : T ) ] \mathcal{L}(\boldsymbol{\psi}) =
\sum_{t=1}^{T} \mathbb{E}_{q(\boldsymbol{z}_t \mid \boldsymbol{\psi})}\!\left[ \log p(\boldsymbol{y}_t \mid \boldsymbol{z}_t, \boldsymbol{\theta}, \boldsymbol{x}_t) \right]
- D_{\mathrm{KL}}\!\left[\, q(\boldsymbol{z}_{1:T} \mid \boldsymbol{\psi}) \,\|\, p(\boldsymbol{z}_{1:T} \mid \boldsymbol{\theta}, \boldsymbol{x}_{1:T}) \,\right] L ( ψ ) = t = 1 ∑ T E q ( z t ∣ ψ ) [ log p ( y t ∣ z t , θ , x t ) ] − D KL [ q ( z 1 : T ∣ ψ ) ∥ p ( z 1 : T ∣ θ , x 1 : T ) ] Joint smoothing over u , z , θ \boldsymbol{u}, \boldsymbol{z}, \boldsymbol{\theta} u , z , θ — Hierarchical:
q ( u 1 : T , z 1 : T , θ ∣ ψ ) = q ( u t ∣ z t , θ , ψ u ) ⏟ per-step state q ( z 1 : T ∣ θ , ψ z ) ⏟ latent trajectory q ( θ ∣ ψ θ ) ⏟ parameters q(\boldsymbol{u}_{1:T}, \boldsymbol{z}_{1:T}, \boldsymbol{\theta} \mid \boldsymbol{\psi}) =
\underbrace{q(\boldsymbol{u}_t \mid \boldsymbol{z}_t, \boldsymbol{\theta}, \boldsymbol{\psi}_u)}_{\text{per-step state}} \,
\underbrace{q(\boldsymbol{z}_{1:T} \mid \boldsymbol{\theta}, \boldsymbol{\psi}_z)}_{\text{latent trajectory}} \,
\underbrace{q(\boldsymbol{\theta} \mid \boldsymbol{\psi}_\theta)}_{\text{parameters}} q ( u 1 : T , z 1 : T , θ ∣ ψ ) = per-step state q ( u t ∣ z t , θ , ψ u ) latent trajectory q ( z 1 : T ∣ θ , ψ z ) parameters q ( θ ∣ ψ θ ) L ( ψ ) = E q ( θ ∣ ψ θ ) [ ∑ t = 1 T E q ( z t ∣ θ , ψ z ) [ E q ( u t ∣ z t , θ , ψ u ) [ log p ( y t ∣ u t , θ , x t ) ] − D K L [ q ( u t ∣ z t , θ , ψ u ) ∥ p ( u t ∣ z t , θ , x t ) ] ] − D K L [ q ( z 1 : T ∣ θ , ψ z ) ∥ p ( z 1 : T ∣ θ , x 1 : T ) ] ] − D K L [ q ( θ ∣ ψ θ ) ∥ p ( θ ∣ x 1 : T ) ] \mathcal{L}(\boldsymbol{\psi}) =
\mathbb{E}_{q(\boldsymbol{\theta} \mid \boldsymbol{\psi}_\theta)}\!\left[
\sum_{t=1}^{T} \mathbb{E}_{q(\boldsymbol{z}_t \mid \boldsymbol{\theta}, \boldsymbol{\psi}_z)}\!\left[
\mathbb{E}_{q(\boldsymbol{u}_t \mid \boldsymbol{z}_t, \boldsymbol{\theta}, \boldsymbol{\psi}_u)}\!\left[ \log p(\boldsymbol{y}_t \mid \boldsymbol{u}_t, \boldsymbol{\theta}, \boldsymbol{x}_t) \right]
- D_{\mathrm{KL}}\!\left[\, q(\boldsymbol{u}_t \mid \boldsymbol{z}_t, \boldsymbol{\theta}, \boldsymbol{\psi}_u) \,\|\, p(\boldsymbol{u}_t \mid \boldsymbol{z}_t, \boldsymbol{\theta}, \boldsymbol{x}_t) \,\right]
\right]
- D_{\mathrm{KL}}\!\left[\, q(\boldsymbol{z}_{1:T} \mid \boldsymbol{\theta}, \boldsymbol{\psi}_z) \,\|\, p(\boldsymbol{z}_{1:T} \mid \boldsymbol{\theta}, \boldsymbol{x}_{1:T}) \,\right]
\right]
- D_{\mathrm{KL}}\!\left[\, q(\boldsymbol{\theta} \mid \boldsymbol{\psi}_\theta) \,\|\, p(\boldsymbol{\theta} \mid \boldsymbol{x}_{1:T}) \,\right] L ( ψ ) = E q ( θ ∣ ψ θ ) [ t = 1 ∑ T E q ( z t ∣ θ , ψ z ) [ E q ( u t ∣ z t , θ , ψ u ) [ log p ( y t ∣ u t , θ , x t ) ] − D KL [ q ( u t ∣ z t , θ , ψ u ) ∥ p ( u t ∣ z t , θ , x t ) ] ] − D KL [ q ( z 1 : T ∣ θ , ψ z ) ∥ p ( z 1 : T ∣ θ , x 1 : T ) ] ] − D KL [ q ( θ ∣ ψ θ ) ∥ p ( θ ∣ x 1 : T ) ] 2E · Amortized Inference ¶ Amortized filtering in latent space — a recurrent network processes
y 1 : t \boldsymbol{y}_{1:t} y 1 : t and emits a distribution over z t \boldsymbol{z}_t z t ; causal:
q ( z t ∣ y 1 : t , x 1 : t , ψ ) q(\boldsymbol{z}_t \mid \boldsymbol{y}_{1:t}, \boldsymbol{x}_{1:t}, \boldsymbol{\psi}) q ( z t ∣ y 1 : t , x 1 : t , ψ ) Amortized smoothing in latent space — a bidirectional encoder reads the full
sequence; non-causal:
q ( z t ∣ y 1 : T , x 1 : T , ψ ) q(\boldsymbol{z}_t \mid \boldsymbol{y}_{1:T}, \boldsymbol{x}_{1:T}, \boldsymbol{\psi}) q ( z t ∣ y 1 : T , x 1 : T , ψ ) Amortized joint — Hierarchical:
q ( u 1 : T , z 1 : T , θ ∣ y 1 : T , x 1 : T , ψ ) = q ( u t ∣ y 1 : T , z t , x t , θ , ψ u ) q ( z 1 : T ∣ y 1 : T , x 1 : T , θ , ψ z ) q ( θ ∣ y 1 : T , x 1 : T , ψ θ ) q(\boldsymbol{u}_{1:T}, \boldsymbol{z}_{1:T}, \boldsymbol{\theta} \mid \boldsymbol{y}_{1:T}, \boldsymbol{x}_{1:T}, \boldsymbol{\psi}) =
q(\boldsymbol{u}_t \mid \boldsymbol{y}_{1:T}, \boldsymbol{z}_t, \boldsymbol{x}_t, \boldsymbol{\theta}, \boldsymbol{\psi}_u) \,
q(\boldsymbol{z}_{1:T} \mid \boldsymbol{y}_{1:T}, \boldsymbol{x}_{1:T}, \boldsymbol{\theta}, \boldsymbol{\psi}_z) \,
q(\boldsymbol{\theta} \mid \boldsymbol{y}_{1:T}, \boldsymbol{x}_{1:T}, \boldsymbol{\psi}_\theta) q ( u 1 : T , z 1 : T , θ ∣ y 1 : T , x 1 : T , ψ ) = q ( u t ∣ y 1 : T , z t , x t , θ , ψ u ) q ( z 1 : T ∣ y 1 : T , x 1 : T , θ , ψ z ) q ( θ ∣ y 1 : T , x 1 : T , ψ θ ) L ( ψ ) = E q ( θ ∣ y , x , ψ θ ) [ ∑ t = 1 T E q ( z t ∣ y , x , θ , ψ z ) [ E q ( u t ∣ y , z t , x t , θ , ψ u ) [ log p ( y t ∣ u t , θ , x t ) ] − D K L [ q ( u t ∣ y , z t , x t , θ , ψ u ) ∥ p ( u t ∣ z t , θ , x t ) ] ] − D K L [ q ( z 1 : T ∣ y , x , θ , ψ z ) ∥ p ( z 1 : T ∣ θ , x 1 : T ) ] ] − D K L [ q ( θ ∣ y , x , ψ θ ) ∥ p ( θ ∣ x 1 : T ) ] \mathcal{L}(\boldsymbol{\psi}) =
\mathbb{E}_{q(\boldsymbol{\theta} \mid \boldsymbol{y}, \boldsymbol{x}, \boldsymbol{\psi}_\theta)}\!\left[
\sum_{t=1}^{T} \mathbb{E}_{q(\boldsymbol{z}_t \mid \boldsymbol{y}, \boldsymbol{x}, \boldsymbol{\theta}, \boldsymbol{\psi}_z)}\!\left[
\mathbb{E}_{q(\boldsymbol{u}_t \mid \boldsymbol{y}, \boldsymbol{z}_t, \boldsymbol{x}_t, \boldsymbol{\theta}, \boldsymbol{\psi}_u)}\!\left[ \log p(\boldsymbol{y}_t \mid \boldsymbol{u}_t, \boldsymbol{\theta}, \boldsymbol{x}_t) \right]
- D_{\mathrm{KL}}\!\left[\, q(\boldsymbol{u}_t \mid \boldsymbol{y}, \boldsymbol{z}_t, \boldsymbol{x}_t, \boldsymbol{\theta}, \boldsymbol{\psi}_u) \,\|\, p(\boldsymbol{u}_t \mid \boldsymbol{z}_t, \boldsymbol{\theta}, \boldsymbol{x}_t) \,\right]
\right]
- D_{\mathrm{KL}}\!\left[\, q(\boldsymbol{z}_{1:T} \mid \boldsymbol{y}, \boldsymbol{x}, \boldsymbol{\theta}, \boldsymbol{\psi}_z) \,\|\, p(\boldsymbol{z}_{1:T} \mid \boldsymbol{\theta}, \boldsymbol{x}_{1:T}) \,\right]
\right]
- D_{\mathrm{KL}}\!\left[\, q(\boldsymbol{\theta} \mid \boldsymbol{y}, \boldsymbol{x}, \boldsymbol{\psi}_\theta) \,\|\, p(\boldsymbol{\theta} \mid \boldsymbol{x}_{1:T}) \,\right] L ( ψ ) = E q ( θ ∣ y , x , ψ θ ) [ t = 1 ∑ T E q ( z t ∣ y , x , θ , ψ z ) [ E q ( u t ∣ y , z t , x t , θ , ψ u ) [ log p ( y t ∣ u t , θ , x t ) ] − D KL [ q ( u t ∣ y , z t , x t , θ , ψ u ) ∥ p ( u t ∣ z t , θ , x t ) ] ] − D KL [ q ( z 1 : T ∣ y , x , θ , ψ z ) ∥ p ( z 1 : T ∣ θ , x 1 : T ) ] ] − D KL [ q ( θ ∣ y , x , ψ θ ) ∥ p ( θ ∣ x 1 : T ) ] Summary Table ¶ Table 2: Targets and methods across both tracks. y , x \boldsymbol{y}, \boldsymbol{x} y , x conditioning is abbreviated.
Track Step Target Method Simulator Exact p ( u t ∣ y 1 : t ) p(\boldsymbol{u}_t \mid \boldsymbol{y}_{1:t}) p ( u t ∣ y 1 : t ) intractable (nonlinear) Simulator Exact p ( u t ∣ y 1 : T ) p(\boldsymbol{u}_t \mid \boldsymbol{y}_{1:T}) p ( u t ∣ y 1 : T ) intractable (nonlinear) Simulator Filtering p ( u t ∣ y 1 : t ) p(\boldsymbol{u}_t \mid \boldsymbol{y}_{1:t}) p ( u t ∣ y 1 : t ) Kalman / EnKF / Particle Simulator Smoothing p ( u t ∣ y 1 : T ) p(\boldsymbol{u}_t \mid \boldsymbol{y}_{1:T}) p ( u t ∣ y 1 : T ) RTS / Particle smoother Simulator VI q ( u 1 : T ∣ ψ ) q(\boldsymbol{u}_{1:T} \mid \boldsymbol{\psi}) q ( u 1 : T ∣ ψ ) factored or structured ELBO Simulator VI + params q ( u 1 : T , θ ∣ ψ ) q(\boldsymbol{u}_{1:T}, \boldsymbol{\theta} \mid \boldsymbol{\psi}) q ( u 1 : T , θ ∣ ψ ) hierarchical ELBO Simulator Amortized filter q ( u t ∣ y 1 : t , ψ ) q(\boldsymbol{u}_t \mid \boldsymbol{y}_{1:t}, \boldsymbol{\psi}) q ( u t ∣ y 1 : t , ψ ) recurrent network Simulator Amortized smooth q ( u t ∣ y 1 : T , ψ ) q(\boldsymbol{u}_t \mid \boldsymbol{y}_{1:T}, \boldsymbol{\psi}) q ( u t ∣ y 1 : T , ψ ) bidirectional encoder Simulator Amortized joint q ( u 1 : T , θ ∣ y , ψ ) q(\boldsymbol{u}_{1:T}, \boldsymbol{\theta} \mid \boldsymbol{y}, \boldsymbol{\psi}) q ( u 1 : T , θ ∣ y , ψ ) hierarchical amortized ELBO Emulator Training q ( z t ∣ u t , ψ e m ) q(\boldsymbol{z}_t \mid \boldsymbol{u}_t, \boldsymbol{\psi}_{\mathrm{em}}) q ( z t ∣ u t , ψ em ) sequential VAE ELBO Emulator Exact p ( z t ∣ y 1 : t ) p(\boldsymbol{z}_t \mid \boldsymbol{y}_{1:t}) p ( z t ∣ y 1 : t ) intractable Emulator Exact p ( u t ∣ y 1 : T ) p(\boldsymbol{u}_t \mid \boldsymbol{y}_{1:T}) p ( u t ∣ y 1 : T ) intractable Emulator Filtering z \boldsymbol{z} z p ( z t ∣ y 1 : t ) p(\boldsymbol{z}_t \mid \boldsymbol{y}_{1:t}) p ( z t ∣ y 1 : t ) Kalman / EnKF / Particle in z \boldsymbol{z} z Emulator Filtering u \boldsymbol{u} u p ( u t ∣ y 1 : t ) p(\boldsymbol{u}_t \mid \boldsymbol{y}_{1:t}) p ( u t ∣ y 1 : t ) decode from z \boldsymbol{z} z filter Emulator Smoothing z \boldsymbol{z} z p ( z t ∣ y 1 : T ) p(\boldsymbol{z}_t \mid \boldsymbol{y}_{1:T}) p ( z t ∣ y 1 : T ) RTS / Particle smoother in z \boldsymbol{z} z Emulator Smoothing u \boldsymbol{u} u p ( u t ∣ y 1 : T ) p(\boldsymbol{u}_t \mid \boldsymbol{y}_{1:T}) p ( u t ∣ y 1 : T ) decode from z \boldsymbol{z} z smoother Emulator VI q ( z 1 : T ∣ ψ ) q(\boldsymbol{z}_{1:T} \mid \boldsymbol{\psi}) q ( z 1 : T ∣ ψ ) structured ELBO in z \boldsymbol{z} z Emulator VI + state q ( u 1 : T , z 1 : T ∣ ψ ) q(\boldsymbol{u}_{1:T}, \boldsymbol{z}_{1:T} \mid \boldsymbol{\psi}) q ( u 1 : T , z 1 : T ∣ ψ ) hierarchical ELBO Emulator VI + params q ( u 1 : T , z 1 : T , θ ∣ ψ ) q(\boldsymbol{u}_{1:T}, \boldsymbol{z}_{1:T}, \boldsymbol{\theta} \mid \boldsymbol{\psi}) q ( u 1 : T , z 1 : T , θ ∣ ψ ) full hierarchical ELBO Emulator Amortized filter q ( z t ∣ y 1 : t , ψ ) q(\boldsymbol{z}_t \mid \boldsymbol{y}_{1:t}, \boldsymbol{\psi}) q ( z t ∣ y 1 : t , ψ ) recurrent network in z \boldsymbol{z} z Emulator Amortized smooth q ( z t ∣ y 1 : T , ψ ) q(\boldsymbol{z}_t \mid \boldsymbol{y}_{1:T}, \boldsymbol{\psi}) q ( z t ∣ y 1 : T , ψ ) bidirectional encoder in z \boldsymbol{z} z Emulator Amortized joint q ( u , z , θ ∣ y , x , ψ ) q(\boldsymbol{u}, \boldsymbol{z}, \boldsymbol{\theta} \mid \boldsymbol{y}, \boldsymbol{x}, \boldsymbol{\psi}) q ( u , z , θ ∣ y , x , ψ ) hierarchical amortized ELBO