Skip to content

Machine Learning Setting

Ultimately, the goal for most scientists is to understand everything there is to know about one component in nature. This can be something Earth-related like temperature or drought occurences, something societal like market trends or divorce rates, or something psycological like happiness or happiness. In all of these scenarios, we have something that we want to understand. In a machine learning setting, this is almost equivalent to predicting. From a modeling perspective, once we are able to correctly predict something given a time, a place or other factors, we have effectively learned everything there is to know about that factor. Let's call this component \mathcal{Y} and we can choose any example we want. In order to effectively predict \mathcal{Y}, I need to be given some other factors that are related to \mathcal{Y}. We can call these other factors \mathcal{X}. The nature of machine learning is to use observations of the factors \mathcal{X} to allow us to give the equivalent outcome of \mathcal{Y}. We will outline two approaches how one would address a problem like this in an ML setting.


Machine Learning Model

The most complete way to address problems is to consider the following:

p(y|\mathbf{x}; \theta)

In this setting, we are trying to predict y given some inputs \mathbf{x}. This is known as conditional density estimation. In its pure form, this is a very hard problem because in order for this to be possible, we need to know the probability density function of y and the probability density function of \mathbf{x}. This implies that we have effectively modeled \mathcal{Y} and \mathcal{X} and have a complete understanding of both components and subsequently understand the component of \mathcal{Y} given some instance of \mathcal{X}. So concretely, we need to know p(\mathbf{x},y) as well as p(\mathbf{x}). But again, the problem of density estimation is very challenging especially in multivariate and multidimensional settings.

As somewhat of a proxy for this setting, we can relax the problem to allow for point estimates or approximations by considering an alternative approach where we try to find a function that allows us to use \mathbf{x} to predict y. So instead we have

y = f(\mathbf{x})

where essentially, we are trying to find a mapping from \mathbb{E}[\mathcal{X}] to \mathbb{E}[\mathcal{Y}]. In other words, we have a latent function f(\cdot) that uses \mathbf{x} \sim \mathcal{X} to learn something about y \sim \mathcal{Y}. This is the simplest scenario we can construct and we have effectively bypassed the density estimation and are only considering the expected output of \mathcal{y} given \mathcal{X}. We won't know exactly what would happen given every single case \mathcal{X} but we can at least get some approximation. The literature is vast and there are many methods ranging from tree-based algorithms like Decision trees, to deep-learning based algorithms like neural networks. There are many pros and cons to each and every algorithm but they are all trying to do the same thing: estimate y given \mathbf{x}.

At a very high-level, we have outlined two approaches to addressing the same problem: conditional density estimation - the holy grail of all methods; latent function model: a crude approximation. The literature has shown us is that we can do very well at predicting with latent functions. And it has also been shown that there are many additional assumptions and constraints that we can provide in order to achieve even better approximations to conditional density estimation. While direct density estimation is difficult, there are many modifications we can make to our function approximation even closer to density estimation. The next sections will explore some additions that we can make to allow use to achieve better estimations and they are all related to the notion of uncertainty quantification.


Uncertainty Characterization

Accounting for or quantifying the uncertainty in our model will allow us to get higher robustness and subsequently improve our prediction accuracy. Robustness is defined as how well a model can predict y accurately once the variables \mathbf{x} have been altered or some assumptions have been changed. If I perturbate any input \mathbf{x} by some small amount \delta, I should still be able to yield correct predictions. So a robust model will yield accurate predictions from any \mathbf{x}\in \mathcal{X}. Conversely, if a model isn't robust, there are certain values of \mathcal{X} that will consistently yield predictions that are inaccurate.

Ideally, we want to be able to accurately predict the entire space of \mathcal{Y} given any \mathcal{X} but this can be difficult and sometimes even impossible. The next best thing would be to yield very accurate predictions given some subset of \mathbf{x} \subseteq \mathcal{X}. Knowing the subsets and regions of the input space \mathcal{X} and also the specific regions in our ML pipeline where we fail to produce good predictions is what uncertainty is.


More Data

A trivial example is the case where we have only one input \mathbf{x} and try to create a model that is able to predict y based on that single observation. If we add another input that is different into our model, we are very likely to incorrectly predict the outcome. So an obvious importance to robustness is to get more data (if you can). This is the single most important way to increase the robustness of your model. The more instances and variations of your inputs your model has seen, the more robust your model will become. The great thing is that this will increase the robustness of any machine learning model and is largely responsible for the success of Deep neural networks.


Sensitivity Analysis

Often times, modeling uncertainty directly is difficult. An easier problem to solve

The study of how the uncertainty in the output of a model can be apportioned to the different sources of uncertainty in the model input.

  • Identify which inputs are most influential for the predictions
  • "Backward Process"
  • Examples: Morris, Sobel, Regression

A closely related field is the notion of model interpretability. This idea stems from the idea that we aren't satisfied which functions that simply map inputs \mathbf{x} to outputs y without giving the user any interpretation for why the model made that decision.


Uncertainty Analysis

A way to scrutinize uncertainties in the model parameters, the input data, the assumptions and model structures.

  • How uncertainties propagate through the model
  • Effect of the uncertainties on the predictions
  • Identify the best model or policy given the uncertainty
  • "Foward Analysis"

If we were to directly try to estimate the conditional probability p(y|\mathbf{x};\theta), then we would have

y = f(\mathbf{x})+\epsilon, \mathbf{x}\sim \mathcal{N}()

where \mathbf{x} comes from some probability distribution \mathcal{P} and there is some noise


Language Of Uncertainty

Model Output Inputs
y=f(x) None None None
y=f(x) + \epsilon f\sim \mathcal{S} y \sim \mathcal{Q} x \sim \mathcal{P}
y=f(x) + \epsilon
Name Model
Aleatoric \mathbf{x,y}\sim \mathcal{P,Q}
Epistemic f\sim \mathcal{S}
Distribution Shift \mathcal{x}\sim\mathcal{X}

Data Representation

Modeling is essentially trying to transform \mathbf{x} s.t. we get a linear relationship between y and \mathbf{x}. Mathematics allows us to transform our problem into different spaces where our problem can be solved more easily.

Example

A very simple example is identifying wave patterns along the coast of the ocean. We know that we can describe a wave using a series of signals which can be obtained using a Fourier decomposition. So by transforming our inputs into the Fourier domain, it is easier to identify which signals are the most representative of our region. And consequently, this allows us to make accurate predictions; but within the Fourier domain. So the representation of our data as sines and cosines made our problem much easier to solve.

Data representation is very important and can transform our ML problem into a much easier one, once we have the correct representation.


Linear Methods

The simplest representation would be a linear transformation using a weight vector \mathbf{w} \in \mathbb{R}^{D} and a bias term b \in \mathbb{R}.

y = \mathbf{wx} + b

It is a simple transformation with a linear. It is invertible, it is interpretable and we can easily do sensitivity analysis. In addition, because it is so simple, we could even do more complex predictions for example quantile regression which allows one to predict the upper bound and lower bound of a confidence interval. We can even deal with uncertain inputs by means of error-in-variables regression such as Total LSR, BCES or ODR. In addition, even if we don't use LR directly for modeling, we often use it as a measure of how well our other nonlinear model has performed. Almost all regression statistics are some form of regression, e.g. Pearson is simple linear regression where as Spearman's \rho and Kendall's \tau are forms of ranked regression. So even if your data is very non-linear, you will end up using simple LR to ensure that your potentially complex, non-linear model has accurate predictions.

TODO

Example with a linear model fitting a linear relationship and a complex relationship. Joint Plot


Neural Networks

To get more expressive functions, we can compose a series of transformation such that each compositDeep composite transforms for feature representations

y = f_1\circ\;f_2\;\circ\;...\;\circ\;f_L( \mathbf{x})

When we concatenate these functions, we are facilitating more complex representations of our dataset.

TODO

Example with a NN model fitting a linear relationship and a complex relationship. Joint Plot


Kernel methods

High dimensional transformation

TODO

Example with a kernel model fitting a linear relationship and a complex relationship. Joint Plot


PDF Estimation

Probability density estimates

TODO

Example with a PDF estimator fitting a linear relationship and a complex relationship. Joint Plot

Because we are in probability space, we have access to other attributes of our data * Q: How much information about X does y contain?


TODO

EXAMPLES

ML Model

  1. Conditional Est Example (Data, Graphical Model) - Example | Example
  2. Direct Latent Function Example (Data, Graphical Model)

Uncertainty Analysis

  1. More Data - look at the residuals
  2. Sensitivity Analysis - Sobel, GP
  3. Uncertainty - GP (model, input, output)

Citations

  • arxiv - Conditional Density Estimation with Neural Networks: BestPractices and Benchmarks - Rothfuss et al (2019)