Bayesian Learning Rule

A generalization of many optimization schemes

CNRS
MEOM

We assume that there exists a subclass of distributions, Q\mathcal{Q}, which we can optimize over. In this case, we will assume that Q\mathcal{Q} is a regular set and minimal exponential family of the form

q(z;λ)=h(z)exp[λ,T(z)A(λ)]q(z;\boldsymbol{\lambda}) = \boldsymbol{h}(\boldsymbol{z}) \exp \left[ \left\langle \boldsymbol{\lambda}, \boldsymbol{T}(\boldsymbol{z})\right\rangle - \boldsymbol{A}(\boldsymbol{\lambda}) \right]
  • λΩRM\boldsymbol{\lambda}\in\Omega\subset\mathbb{R}^M are the natural (or canonical) parameters.
  • A(λ)\boldsymbol{A}(\boldsymbol{\lambda}) is the log partition (or cumulant) function which is finite, strictly convex and differentiable over Ω\Omega.
  • T(λ)\boldsymbol{T}(\boldsymbol{\lambda}) is the sufficient statistics ,\langle \cdot,\cdot\rangle is an inner product
  • h(θ\boldsymbol{h}(\boldsymbol{\theta}) is the base measure.
Example: Multivariate Gaussian Distribution

An example of the exponential family is the multivariate Gaussian distribution. Most of us know the form given as

N(zm,S1)exp[12(zm)P(zm)]\mathcal{N}(\boldsymbol{z}|\mathbf{m},\mathbf{S}^{-1}) \propto \exp\left[ -\frac{1}{2}(\boldsymbol{z} - \mathbf{m})^\top \mathbf{P}(\boldsymbol{z} - \mathbf{m}) \right]

we can also write this in the information form of the natural parameters given by

N(zm,S1)exp[(Pm)z+Trace(P2zz)]\mathcal{N}(\boldsymbol{z}|\mathbf{m},\mathbf{S}^{-1}) \propto \exp\left[ (\mathbf{Pm})^\top \boldsymbol{z} + \text{Trace}\left(-\frac{\mathbf{P}}{2} \boldsymbol{zz}^\top\right) \right]

We can also write the expectation parameters [Hamelijnck et al., 2021]

Moment Parameters:θ=(m,P)Natural Parameters:λ=(P1m,12P1)Expectation Parameters:μ=(m,mm+P)\begin{aligned} \text{Moment Parameters}: && \boldsymbol{\theta} &= (\mathbf{m},\mathbf{P}) \\ \text{Natural Parameters}: && \boldsymbol{\lambda} &= (\mathbf{P}^{-1}\mathbf{m},-\frac{1}{2}\mathbf{P}^{-1}) \\ \text{Expectation Parameters}: && \boldsymbol{\mu} &= (\mathbf{m},\mathbf{mm}^\top+\mathbf{P}) \\ \end{aligned}

The expectation parameter is given by the following formula:

μ(λ)=Ezq(λ)[T(z)]\boldsymbol{\mu}(\boldsymbol{\lambda}) = \mathbb{E}_{z\sim q(\boldsymbol{\lambda})} \left[ \boldsymbol{T}(\boldsymbol{z}) \right]

This is a bijective function of λ\lambda. Some examples include the multivariate normal distribution and the Bernoulli distribution.

The Bayesian learning rule (BLR) optimization algorithm tries to locate the best candidate q(z;λ)q^*(z;\lambda) in Q\mathcal{Q} by updating the candidate q(z;λk)q(z;\lambda_k) with the natural parameter λk\lambda_k at iteration kk using a sequence of learning rates ρk>0\rho_k>0. This equation is given by

λ(k+1)=λ(k)ρk~λ[Ezq(z;λk)(p(z,y))H(q(z;λ))]\lambda^{(k+1)} = \lambda^{(k)} - \rho_k\tilde{\nabla}_\lambda \left[ \mathbb{E}_{z\sim q(z;\lambda_k)}\left(p(z,y)\right) - \mathcal{H}(q(z;\lambda)) \right]
References
  1. Hamelijnck, O., Wilkinson, W. J., Loppi, N. A., Solin, A., & Damoulas, T. (2021). Spatio-Temporal Variational Gaussian Processes. arXiv. 10.48550/ARXIV.2111.01732