Variables ¶ Quantity of Interest ¶ Our quantity of interest is temperature.
Quantity of Interest = [ Temperature ] \text{Quantity of Interest} =
\left[
\text{Temperature}
\right] Quantity of Interest = [ Temperature ] So we can write this as:
Quantity of Interest : u = u ( s , t ) u : R 2 × R + → R
\begin{aligned}
\text{Quantity of Interest}: && &&
\boldsymbol{u}&=\boldsymbol{u}(\mathbf{s},t) && &&
\boldsymbol{u}: \mathbb{R}^{2}\times\mathbb{R}^+\rightarrow\mathbb{R}
\end{aligned} Quantity of Interest : u = u ( s , t ) u : R 2 × R + → R Covariates ¶ We choose some Covariates (aka drivers) that we believe are
Covariates = [ Sea Surface Temperature Soil Moisture Geopoential @ 500 ] ⊤ \text{Covariates} =
\begin{bmatrix}
\text{Sea Surface Temperature} \\
\text{Soil Moisture} \\
\text{Geopoential @ 500}
\end{bmatrix}^\top
\text{} Covariates = ⎣ ⎡ Sea Surface Temperature Soil Moisture Geopoential @ 500 ⎦ ⎤ ⊤ So we can write this as
Covariates : x = x ( s , t ) x : R 2 × R + → R 3 \begin{aligned}
\text{Covariates}: && &&
\boldsymbol{x}=\boldsymbol{x}(\mathbf{s},t) && &&
\boldsymbol{x}:\mathbb{R}^{2}\times\mathbb{R}^+ \rightarrow\mathbb{R}^3
\end{aligned} Covariates : x = x ( s , t ) x : R 2 × R + → R 3 Algorithm ¶ Load Data Subset - Variables, Region, Period Remove Climatology - Time Resample - Time, Space Aggregation - Space Aggregation - Time Convert to Samples PseudoCode ¶ # load datacube
# Dims: “Model Variable Ensemble Time Latitude Longitude”
u: xr.Dataset = xr.open_mfdataset(list_of_files)
# subset variables, region, period
variables: List[str] = ["t2m", "sst", "z500", "tmax"]
period: Period = Period(time=slice(t0,t1))
region: Region = init_region("spain")
u: xr.Dataset = select(u, variables, region, period)
# remove climatology
u: xr.Dataset = remove_climatology(u, **params)
# resample time, space
u: xr.Dataset = resample_time(u, **params)
u: xr.Dataset = resample_space(u, **params)
# Split Period
period0 = select_period("present")
period1 = select_period("future")
u0 = select(u, period0)
u1 = select(u, period1)
# spatial averaging
# Dims: “Model Variable Ensemble Time”
u0 = spatial_average(u0, **params)
u1 = spatial_average(u1, **params)
# temporal average
u0 = temporal_average(u0, **params)
u1 = temporal_average(u1, **params)
# convert to samples
new_dims: str = "(Model Ensemble)=Samples"
# Dims: “Samples Variable Time”
u: xr.Dataset = rearrange(u, new_dims)
Model ¶ Some key characteristics of the data that we’re dealing with stem from the preprocessing steps.
In the end, we are left with N N N samples where N N N is the number of models and ensembles.
This leaves us with 30+ points to work with.
Small Data .
We have less than 50 samples.
This limits the need for a high representation power because we do not want to overfit.
Interpretable .
We are using a risk-based approach to climate attribution.
So, when we make predictions, we need to be able to interpret the results and the model.
It would be advantageous if the model matches some underlying scientific understanding of the results.
However, there could also be some new insights.
Uncertainty .
We need to quantify the uncertainty in our predictions and model parameters.
This will be necessary for small datasets.
Previous Work ¶ This is based on the paper by [Zappa & Sheperd, 2017 ].
I outline the methodology below.
This assumes the data is a field representations
D = { u m , x m } m = 1 M , u m ∈ R D Ω x m ∈ R D x \begin{aligned}
\mathcal{D} &= \{ \mathbf{u}_m, \mathbf{x}_m\}_{m=1}^M, && &&
\mathbf{u}_m\in\mathbb{R}^{D_\Omega} &&
\mathbf{x}_m\in\mathbb{R}^{D_x}
\end{aligned} D = { u m , x m } m = 1 M , u m ∈ R D Ω x m ∈ R D x They also assume that every dimension for the QoI has a set of parameters.
p ( u , x , θ ) = ∏ m = 1 M ∏ n = 1 N s p ( u m n ∣ x m , θ n ) p ( θ n ) p(u,\mathbf{x},\boldsymbol{\theta}) =
\prod_{m=1}^M\prod_{n=1}^{N_s} p(u_{mn}|\mathbf{x}_m,\boldsymbol{\theta}_n)
p(\boldsymbol{\theta}_n) p ( u , x , θ ) = m = 1 ∏ M n = 1 ∏ N s p ( u mn ∣ x m , θ n ) p ( θ n ) They assume a functional form
u m = f ( x m , θ ) + ε m , ε m ∼ N ( 0 , σ 2 ) \begin{aligned}
\mathbf{u}_m &= \boldsymbol{f}(\mathbf{x}_m,\boldsymbol{\theta}) + \varepsilon_m,
&& && \varepsilon_m \sim \mathcal{N}(0,\sigma^2)
\end{aligned} u m = f ( x m , θ ) + ε m , ε m ∼ N ( 0 , σ 2 ) So, the authors are interested in find a parametric function that maps the covariates to the spatial field.
However, this is a difficult problem because we are mapping a 3D parameter to a D Ω D_\Omega D Ω spatial field.
In other words, we have to map a low-dimensional vector to a high-dimensional spatial field.
In addition, we don’t have many samples.
To combat this, the users use a simple linear model.
f : R D x × R D θ → R D Ω \boldsymbol{f}: \mathbb{R}^{D_x}\times\mathbb{R}^{D_\theta}\rightarrow\mathbb{R}^{D_\Omega} f : R D x × R D θ → R D Ω The function they employ was a simple additive linear model
f ( x , θ ) = w x + b \boldsymbol{f}(\mathbf{x},\boldsymbol{\theta})=\mathbf{w}\mathbf{x} + \mathbf{b} f ( x , θ ) = wx + b where w ∈ R D Ω × D x \mathbf{w}\in\mathbb{R}^{D_\Omega\times D_x} w ∈ R D Ω × D x and b ∈ R D x \mathbf{b}\in\mathbb{R}^{D_x} b ∈ R D x .
Next, they did some sensitivity analysis whereby they query.
They can do this by taking the derivative of the function, f f f , wrt the input parameter, x x x .
J [ f , x ] = ∂ x f ( x , θ ) = w \boldsymbol{J}[\boldsymbol{f},\mathbf{x}] = \partial_\mathbf{x}\boldsymbol{f}(\mathbf{x},\boldsymbol{\theta})=\mathbf{w} J [ f , x ] = ∂ x f ( x , θ ) = w Since this is a linear model, we can calculate this in closed-form.
∂ x f ( x , θ ) = w \partial_\mathbf{x}\boldsymbol{f}(\mathbf{x},\boldsymbol{\theta})=\mathbf{w} ∂ x f ( x , θ ) = w In addition, they query the function, f f f , with a distribution of input parameters, x \mathbf{x} x
x n ∼ p ( x ) u n = f ( x n , θ ) \begin{aligned}
\mathbf{x}_n &\sim p(\mathbf{x}) \\
\mathbf{u}_n &= \boldsymbol{f}(\mathbf{x}_n,\boldsymbol{\theta})
\end{aligned} x n u n ∼ p ( x ) = f ( x n , θ ) Potential Improvements ¶ Uncertainty Quantification .
UQ is very important for decision-making.
In addition, there is very little data so it is important to quantify the parameters and predictive uncertainty.
Sensitivity Analysis .
They don’t express the notion of sensitivity analysis clearly.
While it is implicit in their formulation, I think there is more work that can be done to look at the SA literature and draw ideas from there.
In addition, a more general form would be beneficial to the community.
Especially because we can use modern tools like auto-differentiation which would work for many model.
Improved Modeling .
We could also improve the modeling.
I outline more about this below.
Candidate Models ¶ Bayesian Model : p ( u , x , θ ) = p ( u ∣ , θ ) p ( θ ) Hierarchical Bayesian Model : p ( u , x , z , θ ) = p ( u ∣ z ) p ( z ∣ x , θ ) p ( θ ) Gaussian Process : p ( u , f , X , θ ) = p ( u ∣ f ) p ( f ∣ X , θ ) p ( θ ) Neural Fields : p ( u , x , s , θ ) = p ( u ∣ x , s , θ ) p ( θ ) \begin{aligned}
\text{Bayesian Model}: && &&
p(u,\boldsymbol{x},\boldsymbol{\theta}) &=
p(u|,\boldsymbol{\theta})
p(\boldsymbol{\theta}) \\
\text{Hierarchical Bayesian Model}: && &&
p(u,\boldsymbol{x},\boldsymbol{z},\boldsymbol{\theta}) &=
p(u|\boldsymbol{z})
p(\boldsymbol{z}|\boldsymbol{x},\boldsymbol{\theta})
p(\boldsymbol{\theta}) \\
\text{Gaussian Process}: && &&
p(u,\boldsymbol{f},\mathbf{X},\boldsymbol{\theta}) &=
p(u|\boldsymbol{f})
p(\boldsymbol{f}|\mathbf{X},\boldsymbol{\theta})
p(\boldsymbol{\theta}) \\
\text{Neural Fields}: && &&
p(u,x,s,\theta) &= p(u|x,s,\theta)p(\theta)
\end{aligned} Bayesian Model : Hierarchical Bayesian Model : Gaussian Process : Neural Fields : p ( u , x , θ ) p ( u , x , z , θ ) p ( u , f , X , θ ) p ( u , x , s , θ ) = p ( u ∣ , θ ) p ( θ ) = p ( u ∣ z ) p ( z ∣ x , θ ) p ( θ ) = p ( u ∣ f ) p ( f ∣ X , θ ) p ( θ ) = p ( u ∣ x , s , θ ) p ( θ ) Bayesian Regression Model .
Our first choice is to use a simple Bayesian regression model.
For more information, see the prediction tutorial .
Hierarchical Bayesian Regression Model .
A second choice is to use a more advance Bayesian regression model where we include latent variables .
This tries to account for unseen variables that are not within the data.
While it is less interpretable than the standard Bayesian regression model, it may be more accurate by incorporating more sources of uncertainty.
Gaussian Process Regression Model .
The last choice is a non-parametric method.
It assumes an underlying function and then conditions on all of the observations.
This method has good predictive uncertainty with well calibrated confidence intervals.
These methods are typically more challenging to apply at scale but they work exceptionally well with small data sources.
Global Latent Variable Model ¶ Joint Distribution
p ( u , x , θ ) = p ( u ∣ x , θ ) p ( θ ) = p ( θ ) ∏ m = 1 M p ( u m ∣ x m , θ )
p(u,x,\theta) = p(u|x,\theta)p(\theta) = p(\theta)\prod_{m=1}^M p(u_m|x_m,\theta) p ( u , x , θ ) = p ( u ∣ x , θ ) p ( θ ) = p ( θ ) m = 1 ∏ M p ( u m ∣ x m , θ ) Probabilistic Model
Data Likelihood : u m ∼ p ( u ∣ x , θ ) Prior : θ ∼ p ( θ ) \begin{aligned}
\text{Data Likelihood}: && &&
u_{m} &\sim p(u|\mathbf{x},\boldsymbol{\theta}) \\
\text{Prior}: && &&
\boldsymbol{\theta} &\sim p(\boldsymbol{\theta})
\end{aligned} Data Likelihood : Prior : u m θ ∼ p ( u ∣ x , θ ) ∼ p ( θ ) Model
u m = f ( x m , θ ) + ε m , ε m ∼ N ( 0 , σ ) 2 \begin{aligned}
u_{m} = \boldsymbol{f}(\mathbf{x}_m,\boldsymbol{\theta}) + \varepsilon_m, && &&
\varepsilon_m\sim\mathcal{N}(0,\sigma^2_)
\end{aligned} u m = f ( x m , θ ) + ε m , ε m ∼ N ( 0 , σ ) 2 Operators
Function : f : R D x → R Gradient : ∂ x f : R D x → R D x \begin{aligned}
\text{Function}: && &&
\boldsymbol{f}&:\mathbb{R}^{D_x} \rightarrow \mathbb{R} \\
\text{Gradient}: && &&
\partial_x\boldsymbol{f}&:\mathbb{R}^{D_x} \rightarrow \mathbb{R}^{D_x}
\end{aligned} Function : Gradient : f ∂ x f : R D x → R : R D x → R D x Local Latent Variable Model ¶ In this case, we do not do any pooling.
Joint Distribution
p ( u , x , z , θ ) = p ( u ∣ z ) p ( z ∣ θ ) p ( θ ) = ∏ m = 1 M ∏ n = 1 N s p ( u m n ∣ z m n ) p ( z m n ∣ x n , θ n ) p ( θ n )
p(u,x,z,\theta) = p(u|z)p(z|\theta)p(\theta) =
\prod_{m=1}^M \prod_{n=1}^{N_s}
p(u_{mn}|\mathbf{z}_{mn})p(\mathbf{z}_{mn}|\mathbf{x}_n,\theta_n)p(\theta_n) p ( u , x , z , θ ) = p ( u ∣ z ) p ( z ∣ θ ) p ( θ ) = m = 1 ∏ M n = 1 ∏ N s p ( u mn ∣ z mn ) p ( z mn ∣ x n , θ n ) p ( θ n ) Probabilistic Model
Data Likelihood : u m n ∼ p ( u n ∣ z n ) Process Likelihood : z n ∼ p ( z ∣ x m , θ ) Prior : θ ∼ p ( θ ) \begin{aligned}
\text{Data Likelihood}: && &&
u_{mn} &\sim p(u_n|z_n) \\
\text{Process Likelihood}: && &&
z_{n} &\sim p(z|\mathbf{x}_m,\boldsymbol{\theta}) \\
\text{Prior}: && &&
\boldsymbol{\theta} &\sim p(\boldsymbol{\theta})
\end{aligned} Data Likelihood : Process Likelihood : Prior : u mn z n θ ∼ p ( u n ∣ z n ) ∼ p ( z ∣ x m , θ ) ∼ p ( θ ) Model
u m n = z m n + ε m , ε m ∼ N ( 0 , σ u 2 ) z n = f ( x m , θ n ) + ε n , ε n ∼ N ( 0 , σ z 2 ) \begin{aligned}
u_{mn} &= z_{mn} + \varepsilon_m, && &&
\varepsilon_m\sim\mathcal{N}(0,\sigma^2_u) \\
z_{n} &= \boldsymbol{f}(\mathbf{x}_m,\boldsymbol{\theta}_n) + \varepsilon_n, && &&
\varepsilon_n\sim\mathcal{N}(0,\sigma^2_z) \\
\end{aligned} u mn z n = z mn + ε m , = f ( x m , θ n ) + ε n , ε m ∼ N ( 0 , σ u 2 ) ε n ∼ N ( 0 , σ z 2 ) Operators
Function : f : R D x → R Gradient : ∂ x f : R D x → R D x \begin{aligned}
\text{Function}: && &&
\boldsymbol{f}&:\mathbb{R}^{D_x} \rightarrow \mathbb{R} \\
\text{Gradient}: && &&
\partial_x\boldsymbol{f}&:\mathbb{R}^{D_x} \rightarrow \mathbb{R}^{D_x}
\end{aligned} Function : Gradient : f ∂ x f : R D x → R : R D x → R D x Hierarchical Latent Variable Model
In this case, we do partial pooling.
Joint Distribution
p ( u , x , z , θ ) = p ( u ∣ z ) p ( z ∣ θ ) p ( θ ) = p ( θ ) ∏ m = 1 M ∏ n = 1 N s p ( u m n ∣ z m n ) p ( z m n ∣ x n , θ )
p(u,x,z,\theta) = p(u|z)p(z|\theta)p(\theta) =
p(\theta)\prod_{m=1}^M \prod_{n=1}^{N_s}
p(u_{mn}|\mathbf{z}_{mn})p(\mathbf{z}_{mn}|\mathbf{x}_n,\theta) p ( u , x , z , θ ) = p ( u ∣ z ) p ( z ∣ θ ) p ( θ ) = p ( θ ) m = 1 ∏ M n = 1 ∏ N s p ( u mn ∣ z mn ) p ( z mn ∣ x n , θ ) Probabilistic Model
u m n ∼ p ( u m n ∣ z n , x m ) z n ∼ p ( u m n ∣ z n ) \begin{aligned}
u_{mn} &\sim p(u_{mn}|z_n, x_m)\\
z_{n} &\sim p(u_{mn}|z_n)\\
\end{aligned} u mn z n ∼ p ( u mn ∣ z n , x m ) ∼ p ( u mn ∣ z n ) Model
u m n = z m n + ε m , ε m ∼ N ( 0 , σ u 2 ) z m n = f ( x n , θ ) + ε m , ε m ∼ N ( 0 , σ z 2 ) \begin{aligned}
u_{mn} &= z_{mn} + \varepsilon_m, && &&
\varepsilon_m\sim\mathcal{N}(0,\sigma^2_u) \\
z_{mn} &= \boldsymbol{f}(\mathbf{x}_n,\boldsymbol{\theta}) + \varepsilon_m, && &&
\varepsilon_m\sim\mathcal{N}(0,\sigma^2_z) \\
\end{aligned} u mn z mn = z mn + ε m , = f ( x n , θ ) + ε m , ε m ∼ N ( 0 , σ u 2 ) ε m ∼ N ( 0 , σ z 2 ) Operators
Function : f : R D x → R Gradient : ∂ x f : R D x → R D x \begin{aligned}
\text{Function}: && &&
\boldsymbol{f}&:\mathbb{R}^{D_x} \rightarrow \mathbb{R} \\
\text{Gradient}: && &&
\partial_x\boldsymbol{f}&:\mathbb{R}^{D_x} \rightarrow \mathbb{R}^{D_x}
\end{aligned} Function : Gradient : f ∂ x f : R D x → R : R D x → R D x PseudoCode - From Scratch ¶ # get inputs
x: Array["M D"] = ...
s: Array["N"] = ...
u: Array["N"] = ...
# get dimensions
num_models, num_dims = x.shape
num_spatial_points = s.shape
# scratch
sigma_z: Array["N"] = sample("noise_z", dist, num_samples=N)
weight_z: Array["N D"] = sample("weight", dist, num_samples=N)
bias_z: Array["N"] = sample("bias", dist, num_samples=N)
# z = w x + b + noise
z: Array["M N"] = einsum("ND,MD->MN", weight_z, x)
z += b + sigma_z
sigma_u: Array["M N"] = sample("noise_u", dist, num_samples=(M, N))
u: Array["M N"] = sample("obs", dist.Normal(z, sigma_u), obs=u)
PseudoCode - Plate Notation ¶ # get inputs
x: Array["M D"] = ...
s: Array["N"] = ...
u: Array["M N"] = ...
# create plate for spatial domain
spatial_plate = plate(num_spatial_points)
model_plate = plate(num_models)
with spatial_plate:
sigma_z: Array[""] = sample("noise_z", dist)
weight_z: Array["D"] = sample("weight", dist)
bias_z: Array[""] = sample("bias", dist)
with model_plate:
# z = w x + b + noise
z: Array[""] = einsum("D,D->", x, weight_z) + b + sigma_z
sigma_u: Array[""] = sample("noise_u", dist)
u: Array[""] = sample("obs", dist.Normal(z, sigma_u), obs=u)
Gaussian Process Model ¶ Joint Distribution
p ( u , x , z , f , s , θ ) = p ( u ∣ z ) p ( z ∣ f , x , θ ) p ( f ∣ s , α ) p ( θ ) p ( α ) = p ( θ ) p ( α ) ∏ m = 1 M ∏ n = 1 N s p ( u m n ∣ z n ) p ( z n ∣ x m , f n , θ ) p ( f n ∣ s n , α )
\begin{aligned}
p(u,x,z,f,s,\theta) &= p(u|z)p(z|f,x,\theta)p(f|s,\alpha)p(\theta)p(\alpha)\\
&=
p(\theta)p(\alpha)\prod_{m=1}^M \prod_{n=1}^{N_s}
p(u_{mn}|\mathbf{z}_{n})p(\mathbf{z}_{n}|\mathbf{x}_m, \boldsymbol{f}_n, \boldsymbol{\theta})
p(\boldsymbol{f}_n|\mathbf{s}_n,\boldsymbol{\alpha})
\end{aligned} p ( u , x , z , f , s , θ ) = p ( u ∣ z ) p ( z ∣ f , x , θ ) p ( f ∣ s , α ) p ( θ ) p ( α ) = p ( θ ) p ( α ) m = 1 ∏ M n = 1 ∏ N s p ( u mn ∣ z n ) p ( z n ∣ x m , f n , θ ) p ( f n ∣ s n , α ) Probabilistic Model
Data Likelihood : u m n ∼ p ( u n ∣ z n m , ) Process Likelihood : z n ∼ p ( z ∣ x m , θ ) Prior : θ ∼ p ( θ ) \begin{aligned}
\text{Data Likelihood}: && &&
u_{mn} &\sim p(u_n|z_nm,) \\
\text{Process Likelihood}: && &&
z_{n} &\sim p(z|\mathbf{x}_m,\boldsymbol{\theta}) \\
\text{Prior}: && &&
\boldsymbol{\theta} &\sim p(\boldsymbol{\theta})
\end{aligned} Data Likelihood : Process Likelihood : Prior : u mn z n θ ∼ p ( u n ∣ z n m , ) ∼ p ( z ∣ x m , θ ) ∼ p ( θ ) # get the
S: Array["N S"] = ...
X: Array["M Dx"] = ...
U: Array["M Dx"] = ...
# initialize parameters
theta_f: Array[""] = sample(dist)
sigma: Array[""] = sample(dist)
B: Array["N D"] = sample(dist)
K: Array["N N"] = kernel(theta_f)
with plate("models", M):
# initialize mean model
Bx: Array["N"] = einsum("ND,D->N",B,x)
# initialize GP model
gp: Model = GaussianProcess(kernel, S, diag=sigma, Bx)
# sample from GP Model
u: Array["N"] = sample("f", gp.sample(), obs=u)
Back to Basics ¶ Discrete Field Representation
We have the data in the form of a field.
D = { u m , x m } m = 1 M u m ∈ R D Ω x m ∈ R D x \begin{aligned}
\mathcal{D} &= \{ \mathbf{u}_m, \mathbf{x}_m \}_{m=1}^M && &&
\mathbf{u}_m\in\mathbb{R}^{D_\Omega} && &&
\mathbf{x}_m\in\mathbb{R}^{D_x}
\end{aligned} D = { u m , x m } m = 1 M u m ∈ R D Ω x m ∈ R D x Now, our target problem is:
f : R D x → R D Ω \boldsymbol{f}: \mathbb{R}^{D_x}\rightarrow\mathbb{R}^{D_\Omega} f : R D x → R D Ω Coordinate Representation
We have the data in the form of coordinates
D = { ( s n , u n ) m , x m } m = 1 , n = 1 M , N s s n ∈ R D s x m ∈ R D x u m n ∈ R \begin{aligned}
\mathcal{D} &= \{ (s_n,u_n)_m, \mathbf{x}_m \}_{m=1,n=1}^{M,N_s} && &&
\mathbf{s}_n\in\mathbb{R}^{D_s} &&
\mathbf{x}_m\in\mathbb{R}^{D_x} &&
u_{mn}\in\mathbb{R}
\end{aligned} D = {( s n , u n ) m , x m } m = 1 , n = 1 M , N s s n ∈ R D s x m ∈ R D x u mn ∈ R Now, we have our target problem
f : R D s × R D x × R D θ → R D u \boldsymbol{f}: \mathbb{R}^{D_s}\times\mathbb{R}^{D_x}\times\mathbb{R}^{D_\theta}
\rightarrow\mathbb{R}^{D_u} f : R D s × R D x × R D θ → R D u