Modern 4DVar

Bi-Level Optimization

How to think about modern 4DVar formulations

CNRS

MEOM

In early sections, we identified some estimation problems that we encountered. For example, we can do state estimation and parameter estimation. However, how can we deal with estimating the state and parameters simultaneously? Plug-in-Play priors is one way to handle this as we can simply pre-train our model on historical observations and then do zero-shot learning.

This is where bi-level optimization comes into play. There is a lot of theory that looks at how we can learn the state and the parameters.

Formulation¶

\begin{aligned} \text{Outer-Level Objective}: && && \boldsymbol{\theta}^* &= \underset{\boldsymbol{\theta}}{\text{argmin }} \mathcal{L}(\boldsymbol{\theta},\mathbf{x}^*(\boldsymbol{\theta})) \\ \text{Inner-Level Objective}: && && \mathbf{x}^*(\boldsymbol{\theta}) &= \underset{\mathbf{x}}{\text{argmin }} \mathcal{U}(\mathbf{x},\boldsymbol{\theta}) \end{aligned}

(1)

Argmin Differentiation¶

\frac{\partial \mathcal{L}(\boldsymbol{\theta},\mathbf{x}^*(\boldsymbol{\theta}))}{\partial \boldsymbol{\theta}} = \frac{\partial \mathcal{L}(\boldsymbol{\theta},\mathbf{x}^*(\boldsymbol{\theta}))}{\partial \mathbf{x}} \frac{\partial \mathcal{U}(\mathbf{x},\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}

(2)

So the question is how do we get the derivative of the inner objective wrt the parameters $\boldsymbol{\theta}$

\frac{\partial \mathcal{U}(\mathbf{x},\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}

(3)

Unrolling¶

We assume the solution at the end of the unrolling is the solution to the minimization problem

x_k(\theta) \approx x^*(\theta)

(4)

We define a loss function which takes this value

\mathcal{L}_k(\theta) := f(\theta, x_k(\theta))

(5)

We can define the process for unrolling. Let's choose an initial value, $x_0(\theta)$ . We define the unrolling step as

x_k(\theta): x_{k+1}(\theta) = x_k(\theta) - \lambda \nabla_x\mathcal{U}(\theta,x_k(\theta))

(6)

Note: We can choose whatever algorithm we want for this entire step. For example, we can choose SGD or even Adam.

Similarly, we can also choose an algorithm for how we find the parameters

\theta_n^*:=\theta_{n+1} = \theta_n - \mu \nabla_\theta \mathcal{L}_k(\theta_n)

(7)

Note, we assume that the best parameter for the minimization problem, $\theta_k^*$ is a good choice for the parameter estimation problem, $\theta_n$ .

Implicit Differentiation¶

We focus on the argmin differentiation problem

\mathbf{x}^*(\boldsymbol{\theta}) = \underset{\mathbf{x}}{\text{argmin }} \mathcal{U}(\mathbf{x},\boldsymbol{\theta})

(8)

Assumptions:

Strongly convex in $x$
Smooth

Implicit Function Theorem

This states that $x^*(\theta)$ is a unique solution of

\nabla_x \mathcal{U}(\theta, x^*(\theta)) = 0

(9)

Note: unrolling just does $\nabla_x (x_k(\theta))$ . My job is to construct the solution!

Implications: This holds for all $\theta$ 's!

Goal: Estimation $\nabla_\theta x^*(\theta)$ .

Result: A Linear System!

\partial_x \partial_\theta \mathcal{U}(\theta,x^*(\theta)) + \partial^2_x\mathcal{U}(\theta,x^*(\theta)) \nabla_\theta x^*(\theta) = 0

(10)

We can simplify this

B(\theta) + A(\theta) \nabla_\theta x^*(\theta) = 0

(11)

In Theory: We can find a Jacobian by solving the linear system (in theory).

In Practice: We don't need the Hessian. We just need Hessian vector products! If we observe the term, we notice that we get

A(\theta) = [D_\theta \times D_x]

(12)

Observe: Let's look at the original loss function. Using the chain rule we get:

\nabla \mathcal{L}(\theta) = \partial_\theta f(\theta, x^*(\theta)) + \nabla_\theta x^*(\theta)^\top \partial_x f(\theta, x^*(\theta)) = 0

(13)

And now, let's look at the linear system we want to solve

\nabla_\theta x^*(\theta) = - \left[A(\theta)\right]^{-1}B(\theta)

(14)

which is awful. So plugging this back into the equation, we get:

\nabla \mathcal{L}(\theta) = \partial_\theta f(\theta, x^*(\theta)) + \left(-\left[A(\theta)\right]^{-1}B(\theta)\right)^\top \partial_x f(\theta, x^*(\theta)) = 0

(15)

However, looking at the sizes, we notice that

(D_x x D_x)(???)(D_x)

This is hard: (D_x x D_x x ???)()

This is easy: (D_x x D_x)(????x D_x)

This is simply a vjp -> $B(\theta)^\top$ .

Note: We can also solve the linear system using gradient descent and Hessian products instead of pure hessians.

Computational Cost¶

The cost is almost the same in terms of computations.

Unrolling	Implicit Differentiation
$k$ -steps forward (unrolling)	$k$ -steps for optimization $\nabla_\theta x^*(\theta)$
$k$ -steps backwards (backprop unrolling)	$k$ -steps for linear system opt ( $Ax-b$ )

The cost is better for memory because we don't have to do unrolling!

Approximate Soln or Gradient¶

\mathcal{L}_k(\theta) \approx \nabla\mathcal{L}(\theta)

(16)

Do we approximate the solution (unrolling) or approximate the gradient (Implicit Diff)

Warm Starts¶

We can do a warmstart for the linear system optimization.

x^*(\theta)=-[A(\theta)]^{-1}B(\theta)

(17)

by starting from an already good solution. (Is this called pre-conditioning?)

Strongly Convex Solution¶

Not really... (Michael Work says overparameterized systems converges faster!)

Inference

Bayesian Learning Rule

Inference

Meta-Learning