3DVar — Three-Dimensional Variational Assimilation

3DVar extends Optimal Interpolation to nonlinear observation operators. When $H$ is linear the posterior is Gaussian and the analysis is a single matrix expression (the BLUE). When $H$ is nonlinear that closed form is lost, so instead of solving for the posterior exactly we compute its mode — the maximum a posteriori (MAP) estimate — by iteratively minimising a variational cost Lorenc, 1986Talagrand, 1997.

The “3D” denotes three spatial dimensions at a single time ( $T = 0$ , no temporal evolution); the dynamical extension is 4DVar. As before, the code is library-agnostic — jaxtyping for shapes, einx for array ops, JAX autodiff for the tangent-linear/adjoint, and a general optimiser (optimistix) for the minimisation.

From the MAP to the Variational Cost¶

The MAP estimate maximises the posterior, equivalently minimising its negative log:

\boldsymbol{x}^\star = \operatorname*{arg\,max}_{\boldsymbol{x}} \, p(\boldsymbol{x} \mid \boldsymbol{y}) = \operatorname*{arg\,min}_{\boldsymbol{x}} \, J(\boldsymbol{x}), \qquad J(\boldsymbol{x}) = \tfrac{1}{2} \| \boldsymbol{x} - \boldsymbol{x}_b \|^2_{\mathbf{B}^{-1}} + \tfrac{1}{2} \| \boldsymbol{y} - H(\boldsymbol{x}) \|^2_{\mathbf{R}^{-1}}.

(1)

The two terms trade off the background (prior) against the observations (via the observation model), each weighted by its inverse covariance.

Derivation — the cost is the negative log-posterior

With a Gaussian prior $p(\boldsymbol{x}) = \mathcal{N}(\boldsymbol{x}; \boldsymbol{x}_b, \mathbf{B})$ and a Gaussian likelihood $p(\boldsymbol{y} \mid \boldsymbol{x}) = \mathcal{N}(\boldsymbol{y}; H(\boldsymbol{x}), \mathbf{R})$ , Bayes’ rule gives $p(\boldsymbol{x} \mid \boldsymbol{y}) \propto p(\boldsymbol{y} \mid \boldsymbol{x}) \, p(\boldsymbol{x})$ , so

\begin{aligned} \boldsymbol{x}^\star &= \operatorname*{arg\,max}_{\boldsymbol{x}} \left[ \log p(\boldsymbol{y} \mid \boldsymbol{x}) + \log p(\boldsymbol{x}) \right] \\ &= \operatorname*{arg\,max}_{\boldsymbol{x}} \left[ -\tfrac{1}{2}(\boldsymbol{y} - H(\boldsymbol{x}))^\top \mathbf{R}^{-1} (\boldsymbol{y} - H(\boldsymbol{x})) -\tfrac{1}{2}(\boldsymbol{x} - \boldsymbol{x}_b)^\top \mathbf{B}^{-1} (\boldsymbol{x} - \boldsymbol{x}_b) + \text{const} \right] \\ &= \operatorname*{arg\,min}_{\boldsymbol{x}} \, J(\boldsymbol{x}). \end{aligned}

(2)

The normalising constants do not depend on $\boldsymbol{x}$ and drop out of the $\arg\min$ .

Gradient and Hessian¶

Minimisation needs the gradient, and Gauss–Newton needs (an approximation of) the Hessian:

\nabla J(\boldsymbol{x}) = \mathbf{B}^{-1}(\boldsymbol{x} - \boldsymbol{x}_b) - H'(\boldsymbol{x})^\top \mathbf{R}^{-1} (\boldsymbol{y} - H(\boldsymbol{x})),

(3)

J''(\boldsymbol{x}) \approx \mathbf{B}^{-1} + H'(\boldsymbol{x})^\top \mathbf{R}^{-1} H'(\boldsymbol{x}),

(4)

where $H'(\boldsymbol{x})$ is the tangent-linear of the observation operator (its adjoint $H'(\boldsymbol{x})^\top$ comes from autodiff — see the observation-model note).

Derivation — gradient and the Gauss–Newton Hessian

Differentiating the background term gives $\nabla_{\boldsymbol{x}} \tfrac{1}{2}(\boldsymbol{x} - \boldsymbol{x}_b)^\top \mathbf{B}^{-1}(\boldsymbol{x} - \boldsymbol{x}_b) = \mathbf{B}^{-1}(\boldsymbol{x} - \boldsymbol{x}_b)$ . For the observation term, the chain rule through $H$ (with Jacobian $H'(\boldsymbol{x})$ ) gives $\nabla_{\boldsymbol{x}} \tfrac{1}{2}(\boldsymbol{y} - H(\boldsymbol{x}))^\top \mathbf{R}^{-1}(\boldsymbol{y} - H(\boldsymbol{x})) = -H'(\boldsymbol{x})^\top \mathbf{R}^{-1}(\boldsymbol{y} - H(\boldsymbol{x}))$ , which sum to (3). Differentiating once more,

J''(\boldsymbol{x}) = \mathbf{B}^{-1} + H'(\boldsymbol{x})^\top \mathbf{R}^{-1} H'(\boldsymbol{x}) \; \underbrace{-\; \big[\nabla^2 H(\boldsymbol{x})\big]^\top \mathbf{R}^{-1} (\boldsymbol{y} - H(\boldsymbol{x}))}_{\text{dropped by Gauss–Newton}}.

(5)

The dropped term contracts the second derivative of $H$ against the residual $\boldsymbol{y} - H(\boldsymbol{x})$ . That residual is small near the MAP (and zero in the noise-free linear-Gaussian case), so Gauss–Newton discards it, leaving the positive-definite approximation (4).

The Gauss–Newton Inner Loop = Iterated BLUE¶

Each Gauss–Newton iteration solves the normal equations for an increment $\delta\boldsymbol{x}$ and updates the iterate:

\big( \mathbf{B}^{-1} + H_k'^\top \mathbf{R}^{-1} H_k' \big) \, \delta\boldsymbol{x} = -\nabla J(\boldsymbol{x}_k), \qquad \boldsymbol{x}_{k+1} = \boldsymbol{x}_k + \delta\boldsymbol{x},

(6)

where $H_k' = H'(\boldsymbol{x}_k)$ . The increment is exactly the minimiser of the linearised cost about $\boldsymbol{x}_k$ — a linear-Gaussian problem with innovation $\boldsymbol{d}_k = \boldsymbol{y} - H(\boldsymbol{x}_k)$ . In other words, 3DVar is a sequence of BLUE analyses, each one re-linearising $H$ about the current iterate.

Each inner solve uses the same matrix-free machinery as OI — never forming $\mathbf{B}$ or the Hessian, only their action on vectors:

import jax
from jax.scipy.sparse.linalg import cg
from jaxtyping import Array, Float

def gauss_newton_step(
    xk: Float[Array, "N"],
    xb: Float[Array, "N"],
    y:  Float[Array, "M"],
    H,             # nonlinear observation operator  x -> H(x)
    apply_B_inv,   # v -> B⁻¹ v
    apply_R_inv,   # v -> R⁻¹ v
) -> Float[Array, "N"]:
    Hk, Hk_lin = jax.linearize(H, xk)              # Hk = H(xk);  Hk_lin(v) = H'_k v
    Hk_adj     = jax.linear_transpose(Hk_lin, xk)  # (H'_k)ᵀ

    grad = apply_B_inv(xk - xb) - Hk_adj(apply_R_inv(y - Hk))[0]   # ∇J(xk)
    def hess(v: Float[Array, "N"]) -> Float[Array, "N"]:          # B⁻¹ + H'_kᵀ R⁻¹ H'_k
        return apply_B_inv(v) + Hk_adj(apply_R_inv(Hk_lin(v)))[0]

    dx, _ = cg(hess, -grad)                                       # solve normal equations
    return xk + dx

Preconditioning: the control-variable transform¶

The Hessian (4) contains $\mathbf{B}^{-1}$ , which is badly conditioned for smooth priors. Operational 3DVar minimises in a whitened control variable $\boldsymbol{\chi}$ defined by $\boldsymbol{x} = \boldsymbol{x}_b + \mathbf{B}^{1/2}\boldsymbol{\chi}$ Bannister, 2008, which turns the background term into an identity and preconditions the problem:

J(\boldsymbol{\chi}) = \tfrac{1}{2}\|\boldsymbol{\chi}\|^2 + \tfrac{1}{2}\|\boldsymbol{y} - H(\boldsymbol{x}_b + \mathbf{B}^{1/2}\boldsymbol{\chi})\|^2_{\mathbf{R}^{-1}}.

(7)

The background Hessian contribution becomes $\mathbf{I}$ , so the conjugate- gradient inner loop converges in far fewer iterations.

Implementation¶

The Gauss–Newton form is a nonlinear least-squares problem: stack the whitened residuals so that $\tfrac{1}{2}\|\boldsymbol{r}(\boldsymbol{x})\|^2 = J(\boldsymbol{x})$ , and hand it to a least-squares solver.

import optimistix as optx
import jax.numpy as jnp
from jaxtyping import Array, Float

def threedvar(
    xb, y, H, whiten_B, whiten_R,            # whiten_B: v -> B^{-1/2} v ;  whiten_R: v -> R^{-1/2} v
    *, solver=optx.GaussNewton(rtol=1e-8, atol=1e-8),
):
    def residual(x: Float[Array, "N"], args) -> Float[Array, "N+M"]:
        return jnp.concatenate([whiten_B(x - xb), whiten_R(y - H(x))])
    sol = optx.least_squares(residual, solver, xb)    # ½‖r‖² = J(x); warm-start at the background
    return sol.value                                  # MAP estimate x*

Table 1:Choosing a minimiser by cost-function geometry.

Minimiser	Use when
`GaussNewton`	cost is approximately quadratic (least-squares) — the default
`LevenbergMarquardt`	highly nonlinear $H$ / Gauss–Newton overshoots
`BFGS`	general-purpose, no least-squares structure assumed
`NonlinearCG`	memory-constrained problems

Convergence is typically tens of iterations for well-conditioned problems. To differentiate through the optimum (e.g. to learn $\mathbf{B}$ or $\mathbf{R}$ ), use an implicit adjoint (optx.ImplicitAdjoint): the implicit function theorem gives the gradient at the solution in $O(1)$ memory, without unrolling the optimiser.

Posterior via the Laplace Approximation¶

For a Gaussian likelihood near the MAP, a second-order (Laplace) expansion of $J$ gives a Gaussian posterior whose covariance is the inverse Gauss–Newton Hessian:

p(\boldsymbol{x} \mid \boldsymbol{y}) \approx \mathcal{N}(\boldsymbol{x}; \boldsymbol{x}^\star, \mathbf{P}^\star), \qquad \mathbf{P}^\star = \big( \mathbf{B}^{-1} + H'(\boldsymbol{x}^\star)^\top \mathbf{R}^{-1} H'(\boldsymbol{x}^\star) \big)^{-1} = \big( J''(\boldsymbol{x}^\star) \big)^{-1}.

(8)

This is exactly the OI posterior covariance with $H$ linearised at the MAP. Sampling and marginal variances are matvecs with $(\mathbf{P}^\star)^{1/2}$ — nothing is materialised. For strongly nonlinear operators the Laplace approximation under-reports uncertainty; reach for an ensemble covariance or an amortized posterior for better calibration.

When 3DVar Is Appropriate¶

Use 3DVar when…

Reach for something else when…

$H$ is nonlinear (the reason to go beyond OI).
A single timestep — no dynamics.
The posterior is unimodal, well-approximated as Gaussian around the MAP.
$\mathbf{B}$ and $\mathbf{R}$ are Gaussian.

Typical cases: snapshot inversions (a single satellite overpass), static-field estimation, multi-instrument fusion at one time, single-overpass methane source attribution.

References¶

Lorenc, A. C. (1986). Analysis Methods for Numerical Weather Prediction. Quarterly Journal of the Royal Meteorological Society, 112(474), 1177–1194. 10.1002/qj.49711247414
Talagrand, O. (1997). Assimilation of Observations, an Introduction. Journal of the Meteorological Society of Japan, 75(1B), 191–209. 10.2151/jmsj1965.75.1B_191
Bannister, R. N. (2008). A Review of Forecast Error Covariance Statistics in Atmospheric Variational Data Assimilation. I: Characteristics and Measurements of Forecast Error Covariances. Quarterly Journal of the Royal Meteorological Society, 134(637), 1951–1970. 10.1002/qj.339