We assume that there exists a subclass of distributions, Q \mathcal{Q} Q , which we can optimize over.
In this case, we will assume that Q \mathcal{Q} Q is a regular set and minimal exponential family of the form
q ( z ; λ ) = h ( z ) exp [ ⟨ λ , T ( z ) ⟩ − A ( λ ) ] q(z;\boldsymbol{\lambda}) =
\boldsymbol{h}(\boldsymbol{z})
\exp
\left[
\left\langle \boldsymbol{\lambda}, \boldsymbol{T}(\boldsymbol{z})\right\rangle
- \boldsymbol{A}(\boldsymbol{\lambda})
\right] q ( z ; λ ) = h ( z ) exp [ ⟨ λ , T ( z ) ⟩ − A ( λ ) ] λ ∈ Ω ⊂ R M \boldsymbol{\lambda}\in\Omega\subset\mathbb{R}^M λ ∈ Ω ⊂ R M are the natural (or canonical) parameters.A ( λ ) \boldsymbol{A}(\boldsymbol{\lambda}) A ( λ ) is the log partition (or cumulant) function which is finite, strictly convex and differentiable over Ω \Omega Ω .T ( λ ) \boldsymbol{T}(\boldsymbol{\lambda}) T ( λ ) is the sufficient statistics
⟨ ⋅ , ⋅ ⟩ \langle \cdot,\cdot\rangle ⟨ ⋅ , ⋅ ⟩ is an inner producth ( θ \boldsymbol{h}(\boldsymbol{\theta} h ( θ ) is the base measure.Example: Multivariate Gaussian Distribution
An example of the exponential family is the multivariate Gaussian distribution.
Most of us know the form given as
N ( z ∣ m , S − 1 ) ∝ exp [ − 1 2 ( z − m ) ⊤ P ( z − m ) ] \mathcal{N}(\boldsymbol{z}|\mathbf{m},\mathbf{S}^{-1}) \propto
\exp\left[
-\frac{1}{2}(\boldsymbol{z} - \mathbf{m})^\top
\mathbf{P}(\boldsymbol{z} - \mathbf{m})
\right] N ( z ∣ m , S − 1 ) ∝ exp [ − 2 1 ( z − m ) ⊤ P ( z − m ) ] we can also write this in the information form of the natural parameters given by
N ( z ∣ m , S − 1 ) ∝ exp [ ( P m ) ⊤ z + Trace ( − P 2 z z ⊤ ) ] \mathcal{N}(\boldsymbol{z}|\mathbf{m},\mathbf{S}^{-1}) \propto
\exp\left[
(\mathbf{Pm})^\top \boldsymbol{z} +
\text{Trace}\left(-\frac{\mathbf{P}}{2} \boldsymbol{zz}^\top\right)
\right] N ( z ∣ m , S − 1 ) ∝ exp [ ( Pm ) ⊤ z + Trace ( − 2 P zz ⊤ ) ] We can also write the expectation parameters [Hamelijnck et al. , 2021 ]
Moment Parameters : θ = ( m , P ) Natural Parameters : λ = ( P − 1 m , − 1 2 P − 1 ) Expectation Parameters : μ = ( m , m m ⊤ + P ) \begin{aligned}
\text{Moment Parameters}: &&
\boldsymbol{\theta} &=
(\mathbf{m},\mathbf{P}) \\
\text{Natural Parameters}: &&
\boldsymbol{\lambda} &=
(\mathbf{P}^{-1}\mathbf{m},-\frac{1}{2}\mathbf{P}^{-1}) \\
\text{Expectation Parameters}: &&
\boldsymbol{\mu} &=
(\mathbf{m},\mathbf{mm}^\top+\mathbf{P}) \\
\end{aligned} Moment Parameters : Natural Parameters : Expectation Parameters : θ λ μ = ( m , P ) = ( P − 1 m , − 2 1 P − 1 ) = ( m , mm ⊤ + P ) The expectation parameter is given by the following formula:
μ ( λ ) = E z ∼ q ( λ ) [ T ( z ) ] \boldsymbol{\mu}(\boldsymbol{\lambda}) =
\mathbb{E}_{z\sim q(\boldsymbol{\lambda})}
\left[
\boldsymbol{T}(\boldsymbol{z})
\right] μ ( λ ) = E z ∼ q ( λ ) [ T ( z ) ] This is a bijective function of λ \lambda λ .
Some examples include the multivariate normal distribution and the Bernoulli distribution.
The Bayesian learning rule (BLR ) optimization algorithm tries to locate the best candidate q ∗ ( z ; λ ) q^*(z;\lambda) q ∗ ( z ; λ ) in Q \mathcal{Q} Q by updating the candidate q ( z ; λ k ) q(z;\lambda_k) q ( z ; λ k ) with the natural parameter λ k \lambda_k λ k at iteration k k k using a sequence of learning rates ρ k > 0 \rho_k>0 ρ k > 0 .
This equation is given by
λ ( k + 1 ) = λ ( k ) − ρ k ∇ ~ λ [ E z ∼ q ( z ; λ k ) ( p ( z , y ) ) − H ( q ( z ; λ ) ) ] \lambda^{(k+1)} = \lambda^{(k)} - \rho_k\tilde{\nabla}_\lambda
\left[
\mathbb{E}_{z\sim q(z;\lambda_k)}\left(p(z,y)\right)
- \mathcal{H}(q(z;\lambda))
\right] λ ( k + 1 ) = λ ( k ) − ρ k ∇ ~ λ [ E z ∼ q ( z ; λ k ) ( p ( z , y ) ) − H ( q ( z ; λ )) ]