Learning to Learn

How can we use nested learning schemes to speed up optimization?

CNRS
MEOM

General Formulation

Whirlwind Tour

Optimized-Based


LSTM Meta-Optimizer

We will pay special attention to the

In many cases, we need to find the best state given the state (and parameters). Most gradient update schemes look like the following where it is fixed.

To find the optimal solution of this problem, we can write it down as:

z(k+1)=z(k)+gk\boldsymbol{z}^{(k+1)} = \boldsymbol{z}^{(k)} + \boldsymbol{g}_k

where gk\boldsymbol{g}_k is some result of a generalized gradient operator

[gk,hk+1]=g(zJ,hk,k;ϕ)[\boldsymbol{g}_k, \boldsymbol{h}_{k+1}] = \boldsymbol{g}(\boldsymbol{\nabla_z}\boldsymbol{J},\boldsymbol{h}_k, k; \boldsymbol{\phi})

where kk is the iteration, ϕ\boldsymbol{\phi} are the parameters of the gradient operator, and h\boldsymbol{h} is the hidden state.

Lg(ϕ)\boldsymbol{L}_g\left(\boldsymbol{\phi} \right)