My Thesis¶
Introduction¶
1.1 Earth Science¶
Outline of problems in Earth science. I'll focus specifically on multivariate, high dimensional Earth science data. Would like to use the DataCubes as inspiration.
1.2 ML Approach¶
Outline my approach to the whole thing. I'm focused on interactions between the data variables described about. Even more specific, I'm focused on the actual ML models themselves and how they can be used. The goal would be conditional density estimation. But point estimates with non-linear functions is a good approximation.
The approaches will consist of similarity measures directly, e.g. \rhoV-Coefficient, Kernel measures and variation of information (information theory). We have a linear, a non-linear and an information theory based.
Alternatively, we could use non-linear models but restricted to ones that take into account uncertainty with Gaussian approximations. 1) complete change of variables methods, 2) approximate Gaussian models with input uncertainty, 3) approximate Gaussian models with sensitivity analysis (backwards uncertainty). Keep in mind the ultimate goal: comparing 2 or more variables.
1.3 Outline¶
Chapter 2 - Model Approximations¶
This chapter basically covers my exploits using models that take into account uncertainty either directly (forward) or allow us to use sensitivity analysis (backwards).
1. Modeling w. Uncertainty¶
What is uncertainty? The language of uncertainty (Bayesian stuff)? How do we model it? Use regression and walk through the cases. GPs in a nutshell as my choice. Explain a bit more in depth about what a GP can and cannot do.
2. Sensitivity¶
What is sensitivity? How it's related to interpretability? How it approximates modeled uncertainty. A way to do uncertainty estimation (permutation plots, Sobel, Morris, SHAP). The derivative of kernel methods and how it gives us a way to approximate the sensitivity of the model itself.
3. Applications¶
SAKAME stuff basically. In an uncertain setting (GP) as well as extensions to other kernel methods (KRR, SVM, HSIC, KDE). Mention the Phi-Week application for emulation. Closing thoughts.
EGP1.0 stuff as well. Where we should how this can be applied to real data using the Taylor expansion.
Chapter 3 - Data Representation¶
We go for a more direct approach. Instead of just going for conditional density estimation (CDE), we just try to look at different ways to represent the data as a way of estimating CDE. I investigate different approaches to doing it including a linear, a non-linear and a PDF estimator.
1. Similarity¶
What is it? How do we define it? How do we visualize it? How it approximates the CDE?
2. Linear and NonLinear¶
We start with the idea of the linear method \rhoV coefficient and show how it extends to the Kernel methods via a distance metric and a non-linear kernel function. We pay special attention to how one can choose the parameters in order to get the best representation for the unsupervised setting of multi-dimensional data.
3. PDF Estimation¶
A different approach to direct modeling: estimating the density directly...using a model. Show the different methods already done including Knn, exponential family, normalizing flows. And the method we choose - Gaussianization. Also talk about the metrics you can use in the form of information - shannon information, entropy, mutual information, variation of information.
4. Applications¶
RBIG4EO Show the applications of using RBIG on spatial-temporal data