This is the first example problem we may want to solve is the state estimation problem. In this case, we may have some prior on the state, some conditional distribution on the quantity of interest and some observations. We can write the joint probability of everything as
We are interested in finding the best state given the QOI and observations. So the posterior probability of the state given the QOI and observations is given by
In theory, this normalizing factor is constant. We typically only work with the Maximum A Posteriori (MAP) approximation which utilizes the fact that we can simple use proportionalities for the posterior.
Note: we chose the MAP estimation because it is the simplest way to minimize this cost function. However, we can utilise other optimization schemes, e.g. Laplace approximation, variational Bayes, or MCMC/HMC.
Now, we need to set distributions and functions for each of the terms on the RHS of the equation.
For example, we could have a uniform prior or a Gaussian prior. We can also vary the complexity of the simple prior. For example, the Gaussian prior could have a full covariance, a diagonal covariance or a scalar covariance. Let's call this the prior term.
We can define an objective function that can determine the faithfulness of the learned transformation between the state and the observations. It is a loss function that encapsulates the data likelihood, i.e. the likelihood that the observations is generated from the function applied to the state. Let's call this the data fidelity term.
We can also define an objective function that can determine the faithfulness of the learning transformation between the state and the prior. Let's call this the regularization term.
Finally, we can describe the full objective function which encapsulates all of the terms described above: 1) the prior term, 2) the data fidelity, and 3) the regularization terms.
This is also known as an energy function as it is related to the Boltzmann or Gibbs distribution. And finally, we can define some minimization function to find the best state, z, given the observations/data, y, and the prior, u.
Note: we have assumed a MAP estimation where we assume that there is a unique solution to the optimization problem. However, we know that this is not necessarily the best solution. We could use other inference methods like MCMC/HMC sampling or Stochastic Gradient Langevin Dynamics (SGLD).
In the section regarding state estimation, it was elluded to throughout the section that we have parameters throughout the In addition, we have parameters throughout the assumptions like the data fidelity and regularization terms. Given some example data
Note: we impose the fact that the transformations between the state and the observations and prior are independent of each other.
Similar to the state estimation section, we can define each of these quantities above with proper distributions and functions.
Data Fidelity:Regularization:State Prior:Params Prior:yuzθ∼p(y∣z;θ)∼p(u∣z;θ)∼p(z∣θ)∼p(θ;α)p(y∣z;θ)∝exp(−D(z;θ))p(u∣z;θ)∝exp(−R(z;θ))p(z∣θ)∝exp(−Pz(z;θ))p(θ;α)∝exp(−Pθ(θ;α))
Finally, we can describe the full objective function which encapsulates all of the terms described above: 1) the prior term, 2) the data fidelity, 3) the regularization terms, and 4) the prior term for the parameters. Now, we can define some objective function which we can minimized to find the best parameters for the likelihood and prior terms. This leads to the MAP estimation whereby we find the best parameters given the data.
Note: We have an extra prior term which is a prior on the parameters. We can always put a prior on the parameters and this leads to more advanced inference regimes like MCMC/HMC and SGLD. However, in general, we normally leave a uniform prior on these parameters and simple use the MLE procedure. See this tutorial for more information about the difference between MLE and MAP.
In the above cases, we were looking at parameter estimation and state estimation separately. However, if we wish to learn this jointly. This ultimately leads to a bi-level optimization scheme whereby we need to find the best parameters, θ, given the state estimation. So we can define some objective function
This can be a good strategy to design synthetic experiments that mimic the real experiments to learn approprate priors. And then we can directly apply this for a state estimation problem with the learned prior.
Note: the difficulty in this scheme is how we can calculate the gradients of the loss function wrt an input that is the solution to another minimization scheme. Using the chain rule, we have
which means we need to calculate the derivative ∂θ∂z; the solution to the minimization problem. aka argmin differentiation. There are methods to do this like unrolling or implicit differentiation.
In the above sections, we were rather complete for how we can estimate the state which incorporates a conditional prior and a condictional observation operator. However, this can be a rather involved operation with a lot of moving pieces. There may be many cases when we don't have strong prior knowledge nor do we have any interest in estimating a state. Perhaps, we just have many/some samples available.
If we have enough data, then it might not be necessary to try and solve the inversion problem directly and instead we can try to estimate our QOI directly without utilizing an intermediate state space.