Skip to article frontmatterSkip to article content

Spaces

NotationDescription
NNN \in \mathbb{N}the number of samples (natural number)
DND \in \mathbb{N}the number of features/covariates (natural number)

Variables

NotationDescription
x,yRx,y \in \mathbb{R}scalers (real numbers)
xRDx\mathbf{x} \in \mathbb{R}^{D_\mathbf{x}}a DxD_\mathbf{x}-dimensional column vector, usually the input.
yRDy\mathbf{y} \in \mathbb{R}^{D_\mathbf{y}}a DyD_\mathbf{y}-dimensional column vector, usually the output.
xjRx^j \in \mathbb{R}the jj-th feature from a vector, xRD\mathbf{x} \in \mathbb{R}^{D}, where (xj)1jD(x^j)_{1\leq j \leq D}
xiRx_i \in \mathbb{R}the ii-th sample from a vector, xRN\mathbf{x} \in \mathbb{R}^{N}, where (xi)1jD(x_i)_{1\leq j \leq D}
XRN×D\mathbf{X} \in \mathbb{R}^{N \times D}a collection of NN input vectors, X=[x1,,xN]\mathbf{X}=[\mathbf{x}_1, \ldots, \mathbf{x}_N]^\top, where xRD\mathbf{x} \in \mathbb{R}^{D}
YRN×P\mathbf{Y} \in \mathbb{R}^{N \times P}a collection of NN output vectors, Y=[y1,,yN]\mathbf{Y}=[\mathbf{y}_1, \ldots, \mathbf{y}_N]^\top, where yRP\mathbf{y} \in \mathbb{R}^{P}
xjRN\mathbf{x}^{j} \in \mathbb{R}^{N}the jj-th feature from a collection of vectors, X\mathbf{X}, where (xj)1jD(\mathbf{x}^{j} )_{1\leq j \leq D}
xiRD\mathbf{x}_{i} \in \mathbb{R}^{D}the ii-th sample from a collection of vectors, X\mathbf{X}, where (xi)1iN(\mathbf{x}_{i})_{1\leq i \leq N}
xijRx_{i}^j \in \mathbb{R}the ii-th sample and jj-th feature from a collection of vectors, X\mathbf{X}, where (xij)1iN,1jD(\mathbf{x}_{i}^{j})_{1\leq i \leq N,1\leq j \leq D}

Functions

NotationDescription
f:XYf : \mathcal{X} \rightarrow \mathcal{Y}a latent function that operates on a scaler and maps a space X\mathcal{X} to a space Y\mathcal{Y}
f:XY\boldsymbol{f} : \mathcal{X} \rightarrow \mathcal{Y}a latent function that operates on a vector and maps a space X\mathcal{X} to a space Y\mathcal{Y}
f(    ;θ)\boldsymbol{f}(\;\cdot\;;\boldsymbol{\theta})a latent function parameterized by θ\boldsymbol{\theta}
fθ()\boldsymbol{f}_{\boldsymbol \theta}(\cdot)a latent function parameterized by θ\boldsymbol{\theta} (succinct version)
k(,)\boldsymbol{k}(\cdot, \cdot)kernel or covariance function

Below, we have some specificities for these functions and how they translate to real situations.

Scalar Input - Scalar Output

f:RRf: \mathbb{R} \rightarrow \mathbb{R}

Vector Input - Scalar Output

f:RDR\boldsymbol{f}: \mathbb{R}^D \rightarrow \mathbb{R}

Example: 1D Spatio-Temporal Scalar Field

y=f(xϕ,t)y = \boldsymbol{f}(x_\phi, t)

Example: 2D Spatial Scalar Field

We have a 2-dimensional scalar field. The coordinates, xRDϕ\mathbf{x} \in \mathbb{R}^{D_\phi}, are 2D, e.g. (lat,lon) coordinates Dϕ=[ϕ,ψ]D_\phi = [\phi, \psi]. Then each of these coordinates are represented by a scalar value, yRy \in \mathbb{R}. So we have a function, f\boldsymbol{f}, maps each coordinate, x\mathbf{x}, of the field to a scalar value, yy, i.e. f:RDϕR\boldsymbol{f}: \mathbb{R}^{D_\phi} \rightarrow \mathbb{R}. More explicitly, we can write this function as:

y=f(xϕ)y = \boldsymbol{f}(\mathbf{x}_\phi)

if we stack a lot of samples together, D={xn,yn}n=1N\mathcal{D} = \left\{ \mathbf{x}_n, y_n\right\}_{n=1}^N, we get a matrix for the coordinates, X\mathbf{X}, and a vector for the scalar values, y\mathbf{y}. So we have D={X,y}\mathcal{D} = \left\{ \mathbf{X}, \mathbf{y}\right\}.

Note: For more consistent and aesthetically pleasing notation, we have Y=y\mathbf{Y} = \mathbf{y}^\top so we can have the dataset, D={X,Y}\mathcal{D} = \left\{ \mathbf{X}, \mathbf{Y}\right\}


Example: 2D Spatio-Temporal Scalar Field

y=f(xϕ,t)y = \boldsymbol{f}(\mathbf{x}_\phi, t)

Vector Input - Vector Output

f:RDRP\boldsymbol{f}: \mathbb{R}^D \rightarrow \mathbb{R}^P

Example: 2D Vector Field

We have a 2-dimensional vector field (similar to the above example). The coordinates, xRDϕ\mathbf{x} \in \mathbb{R}^{D_\phi}, are 2D, e.g. (lat,lon) coordinates Dϕ=[ϕ,ψ]D_\phi = [\phi, \psi]. Then each of these coordinates are represented by a vector value, yRP\mathbf{y} \in \mathbb{R}^{P}. In this case, let the dimensions be the (u,v) fields, i.e. P=[u,v]P=[u,v]. So we have a function, f\boldsymbol{f}, maps each coordinate, x\mathbf{x}, of the field to a vector value, yy, i.e. f:RDϕRP\boldsymbol{f}: \mathbb{R}^{D_\phi} \rightarrow \mathbb{R}^{P}. More explicitly, we can write this function as:

y=f(x)\mathbf{y} = \boldsymbol{f}(\mathbf{x})

Again, if we stack a lot of samples together, D={xn,yn}n=1N\mathcal{D} = \left\{ \mathbf{x}_n, \mathbf{y}_n\right\}_{n=1}^N, we get a stack of matrices, D={X,Y}\mathcal{D} = \left\{ \mathbf{X}, \mathbf{Y}\right\}.


Special Case: D=PD = P

f:R2R2\boldsymbol{f}:\mathbb{R}^2 \rightarrow \mathbb{R}^2

where each of the functions takes in a 2D vector, (x,y)(x,y), and outputs a vector, (u,v)(u, v). This is analagous to scalar field for uu and vv which appears in physics. So

f1(x,y)=uf2(x,y)=v\begin{aligned} f_1(x,y) &= u \\ f_2(x,y) &= v \end{aligned}

We have our functional form given by:

f([xy])=[f1(x,y)f2(x,y)]=[uv]\mathbf{f}\left( \begin{bmatrix} x \\ y \end{bmatrix} \right) = \begin{bmatrix} f_1(x,y) \\ f_2(x,y) \end{bmatrix} = \begin{bmatrix} u \\ v \end{bmatrix}

Common Terms

NotationDescription
θa parameter
θα\theta_\alphaa hyperparameter
θ\boldsymbol{\theta}a collection of parameters, θ=[θ1,θ2,,θp]\boldsymbol{\theta}=[\theta_1, \theta_2, \ldots, \theta_p]
θα\boldsymbol{\theta_\alpha}a collection of hyperparameters, θα=[θα,1,θα,2,,θα,p]\boldsymbol{\theta_\alpha}=[\theta_{\alpha,1}, \theta_{\alpha,2}, \ldots, \theta_{\alpha,p}]

Probability

NotationDescription
X,Y\mathcal{X,Y}the space of data
P,QP,Qthe probability space of data
fX(x)f_\mathcal{X}(\mathbf{x})the probability density function (PDF) on x\mathbf{x}
FX(x)F_\mathcal{X}(\mathbf{x})the cumulative density function (CDF) on x\mathbf{x}
FX1(x)F_\mathcal{X}^{-1}(\mathbf{x})the Quantile or Point Percentile Function (ppf) (i.e. inverse cumulative density function) on x\mathbf{x}
p(x;θ)p(x;\theta)A probability distribution, pp, of the variable xx, parameterized by θ
pθ(x)p_\theta(x)A probability distribution, pp, of the variable xx, parameterized by θ (succinct version)
p(x;θ)p(\mathbf{x};\boldsymbol{\theta})A probability distribution, p, of the multidimensional variable, x\mathbf{x}, parameterized by θ\boldsymbol{\theta}
pθ(x)p_{\boldsymbol{\theta}}(\mathbf{x})A probability distribution, p, of the multidimensional variable, x\mathbf{x}, parameterized by θ\boldsymbol{\theta} (succinct version)
N(x;μ,σ)\mathcal{N}(x; \mu, \sigma)A normal distribution for xx parameterized by μ and σ.
N(x;μ,Σ)\mathcal{N}(\mathbf{x}; \boldsymbol{\mu}, \boldsymbol{\Sigma})A multivariate normal distribution for x\mathbf{x} parameterized by μ\boldsymbol{\mu} and Σ\boldsymbol{\Sigma}.
N(0,ID)\mathcal{N}(\mathbf{0}, \mathbf{I}_D)A multivariate normal distribution with a zero mean and 1 variance.

Information Theory

NotationDescription
I(X)I(X)Self-Information for a rv XX.
H(X)H(X)Entropy of a rv XX.
TC(X)TC(X)Total correlation (multi-information) of a rv XX.
H(X,Y)H(X,Y)Joint entropy of rvs XX and YY.
I(X,Y)I(X,Y)Mutual information between two rvs XX and YY.
DKL(X,Y)\text{D}_{\text{KL}}(X,Y)Jullback-Leibler divergence between XX and YY.

Gaussian Processes

NotationDescription
m\boldsymbol{m}mean function for a Gaussian process.
K\mathbf{K}kernel function for a Gaussian process
GP(m,K)\mathcal{GP}(\boldsymbol{m}, \mathbf{K})Gaussian process distribution parameterized by a mean function, m\boldsymbol{m} and kernel matrix, K\mathbf{K}.
μGP\boldsymbol{\mu}_\mathcal{GP}GP predictive mean function.
σGP2\boldsymbol{\sigma}^2_\mathcal{GP}GP predictive variance function.
ΣGP\boldsymbol{\Sigma}_\mathcal{GP}GP predictive covariance function.

Field Space

The first case, we have

y=H(x)+ϵ\mathbf{y} = \boldsymbol{H}(\mathbf{x}) + \epsilon

This represents the state, x\mathbf{x}, as a representation fo the field

  • xRDx\mathbf{x} \in \mathbb{R}^{D_x} - state
  • μxRDx\boldsymbol{\mu}_{\mathbf{x}} \in \mathbb{R}^{D_x} - mean prediction for state vector
  • σ2xRDx\boldsymbol{\sigma^2}_{\mathbf{x}} \in \mathbb{R}^{D_x} - variance prediction for state vector
  • XΣRDx×Dx\mathbf{X}_{\boldsymbol{\Sigma}} \in \mathbb{R}^{D_x \times D_x} - covariance prediction for state vector
  • XμRN×Dx\mathbf{X}_{\boldsymbol{\mu}} \in \mathbb{R}^{N \times D_x} - variance prediction for state vector

State (Coordinates)

  • xRDϕ\boldsymbol{x} \in \mathbb{R}^{D_\phi} - the coordinate vector
  • μxRDϕ\boldsymbol{\mu}_{\boldsymbol{x}} \in \mathbb{R}^{D_\phi} - mean prediction for state vector
  • σ2xRDx\boldsymbol{\sigma^2}_{\boldsymbol{x}} \in \mathbb{R}^{D_x} - variance prediction for state vector
  • XΣRDϕ×Dϕ\boldsymbol{X}_{\boldsymbol{\Sigma}} \in \mathbb{R}^{D_\phi \times D_\phi} - covariance prediction for state vector
  • XμRN×Dϕ\boldsymbol{X}_{\boldsymbol{\mu}} \in \mathbb{R}^{N \times D_\phi} - variance prediction for state vector

Observations

  • zRDz\mathbf{z} \in \mathbb{R}^{D_z} - latent domain
  • yRDy\mathbf{y} \in \mathbb{R}^{D_y} - observations

Matrices

  • ZRN×Dz\mathbf{Z} \in \mathbb{R}^{N \times D_z} - latent domain

  • XRN×Dx\mathbf{X} \in \mathbb{R}^{N \times D_x} - state

  • YRN×Dy\mathbf{Y} \in \mathbb{R}^{N \times D_y} - observations


Functions

Coordinates

In this case, we assume that the state, xRDϕ\mathbf{x} \in \mathbb{R}^{D_\phi}, are the coordinates, [lat,lon,time][\text{lat,lon,time}], and the output is the value of the variable of interest, y\mathbf{y}, at that point in space and time.

  • [K]ij=k(xi,xj)[\mathbf{K}]_{ij} = \boldsymbol{k}(\mathbf{x}_i, \mathbf{x}_j) - covariance matrix for the coordinates
  • kX(xi)=k(X,xi)\boldsymbol{k}_{\mathbf{X}}(\mathbf{x}_i) = \boldsymbol{k}(\mathbf{X}, \mathbf{x}_i) - cross covariance for the data
  • k(xi,xj):RDϕ×RDϕR\boldsymbol{k}(\mathbf{x}_i, \mathbf{x}_j) : \mathbb{R}^{D_\phi} \times \mathbb{R}^{D_\phi} \rightarrow \mathbb{R} - the kernel function applied to two vectors.

Data Field

In this case, we assume that the state, x\mathbf{x}, is the input

  • [C]ij=c(xi,xj)[\mathbf{C}]_{ij} = \boldsymbol{c}(\mathbf{x}_i, \mathbf{x}_j) - covariance matrix for the data field

Operators


Jacobian

So here, we’re talking about gradients and how they operate on functions.

Scalar Input-Output

f:RRf: \mathbb{R} \rightarrow \mathbb{R}

There are no vectors in this operation so this is simply the derivative.

Jf:RRJf(x)=dfdx\begin{aligned} J_f: \mathbb{R} &\rightarrow \mathbb{R} \\ J_f(x) &= \frac{df}{dx} \end{aligned}

Vector Input, Scalar Output

f:RDR\boldsymbol{f} : \mathbb{R}^D \rightarrow \mathbb{R}

This has vector-inputs so the output dimension of the Jacobian operator will be the same dimensionality as the input vector.

J[f](x):RDRDJf(x)=[fx1fxD]\begin{aligned} \boldsymbol{J}[\boldsymbol{f}](\mathbf{x}) &: \mathbb{R}^{D} \rightarrow \mathbb{R}^D \\ \mathbf{J}_{\boldsymbol{f}}(\mathbf{x}) &= \begin{bmatrix} \frac{\partial f}{\partial x_1} &\cdots \frac{\partial f}{\partial x_D} \end{bmatrix} \end{aligned}

Vector Input, Vector Output

f:RDRP\vec{\boldsymbol{f}} : \mathbb{R}^D \rightarrow \mathbb{R}^P

The inputs are the vector, xRD\mathbf{x} \in \mathbb{R}^D, and the outputs are a vector, yRP\mathbf{y} \in \mathbb{R}^P. So the Jacobian operator will produce a matrix of size JRP×D\mathbf{J} \in \mathbb{R}^{P \times D}.

J[f](x):RDRP×DJ[f](x)=[f1x1f1xDfpx1fpxD]\begin{aligned} \boldsymbol{J}[{\boldsymbol{f}}](\mathbf{x}) &: \mathbb{R}^{D} \rightarrow \mathbb{R}^{P\times D}\\ \mathbf{J}[\boldsymbol{f}](\mathbf{x}) &= \begin{bmatrix} \frac{\partial f_1}{\partial x_1} &\cdots &\frac{\partial f_1}{\partial x_D} \\ \ldots &\ddots & \ldots \\ \frac{\partial f_p}{\partial x_1} &\cdots &\frac{\partial f_p}{\partial x_D} \end{bmatrix} \end{aligned}

Alternative Forms

I’ve also seen alternative forms which depends on whether the authors want to highlight the inputs or the outputs.

Form I: Highlight the input vectors

Jf(x)=[fx1fxD]=[fx1fxD]\mathbf{J}_{\boldsymbol{f}}(\mathbf{x}) = \begin{bmatrix} \frac{\partial \boldsymbol{f}}{\partial x_1} & \cdots & \frac{\partial \boldsymbol{f}}{\partial x_D} \end{bmatrix} = \begin{bmatrix} \frac{\nabla \boldsymbol{f}}{\partial x_1} & \cdots & \frac{\nabla \boldsymbol{f}}{\partial x_D} \end{bmatrix}

Form II: Highlights the output vectors

Jf(x)=[f1xfpx]=[f1fP]\mathbf{J}_{\boldsymbol{f}}(\mathbf{x}) = \begin{bmatrix} \frac{\partial \boldsymbol{f}_1}{\partial \mathbf{x}} \\ \vdots \\ \frac{\partial \boldsymbol{f}_p}{\partial \mathbf{x}} \end{bmatrix} = \begin{bmatrix} \boldsymbol{\nabla}^\top \boldsymbol{f}_1 \\ \vdots \\ \boldsymbol{\nabla}^\top \boldsymbol{f}_P \end{bmatrix}

Special Cases

There are probably many special cases where we have closed-form operators but I will highlight one here which comes up in physics a lot.


2D Vector Input, 2D Vector Output

Recall the special case from the above vectors where the dimensionality of the input vector, xR2\mathbf{x} \in \mathbb{R}^2, is the same dimensionality of the output vector, yR2\mathbf{y} \in \mathbb{R}^2.

f:R2R2\begin{aligned} \boldsymbol{f}&:\mathbb{R}^2 \rightarrow \mathbb{R}^2 \\ \end{aligned}

The functional form was:

f([xy])=[f1(x,y)f2(x,y)]=[uv]\mathbf{f}\left( \begin{bmatrix} x \\ y \end{bmatrix} \right) = \begin{bmatrix} f_1(x,y) \\ f_2(x,y) \end{bmatrix} = \begin{bmatrix} u \\ v \end{bmatrix}

So in this special case, our Jacobian matrix, J\mathbf{J}, will be:

Jf(x,y)=[uxuyvxvy]\mathbf{J}_{\boldsymbol{f}(x,y)} = \begin{bmatrix} \frac{\partial u}{\partial x} & \frac{\partial u}{\partial y} \\ \frac{\partial v}{\partial x} & \frac{\partial v}{\partial y} \end{bmatrix}

Note: This is a square matrix because the dimension of the input vector, (x,y)(x,y), matches the dimension of the output vector, (u,v)(u,v).


Determinant Jacobian

The determinant of the Jacobian is the amount of (volumetric) change. It is given by:

detJf(x):RDR\det \boldsymbol{J}_{\boldsymbol{f}}(\mathbf{x}): \mathbb{R}^D \rightarrow \mathbb{R}

Notice how we input the vectors, x\mathbf{x}, and it results in a scalar, R\mathbb{R}.

Note: This can be a very expensive operation especially with high dimensional data. A naive linear function, f(x)=Ax\boldsymbol{f}(\mathbf{x}) = \mathbf{Ax}, will have an operation of O(D3)\mathcal{O}(D^3). So the name of the game is to try and look at the Jacobian structure and find tricks to reduce the expense of the calculation.


Special Case: Input Vector 2D, Output Vector - 2D

Again, let’s go back to the special case where we have a two input vector, xR2\mathbf{x}\in \mathbb{R}^2, and a 2D output vector, yR2\mathbf{y} \in \mathbb{R}^2. Recall that the Jacobian matrix for the function, f\boldsymbol{f}, is a 2×22\times 2 square matrix. More generally, we can write this as:

J[A(x,y)B(x,y)]=[AxAyBxBy]\boldsymbol{J} \begin{bmatrix} A(x,y) \\ B(x,y) \end{bmatrix} = \begin{bmatrix} \frac{\partial A}{\partial x} & \frac{\partial A}{\partial y} \\ \frac{\partial B}{\partial x} & \frac{\partial B}{\partial y} \end{bmatrix}

To calculate the determinant of this Jacobian matrix, we has a closed-form expression. It’s given by:

detJ=ADBC\det \mathbf{J} = AD - BC

So if we apply it to our notation

detJf(x,y)=f1xf2yf1yf2x\det \mathbf{J}_{\mathbf{f}}(x,y) = \frac{\partial f_1}{\partial x}\frac{\partial f_2}{\partial y} - \frac{\partial f_1}{\partial y}\frac{\partial f_2}{\partial x}

This is probably the easiest determinant Jacobian to calculate (apart from the scalar-valued which is simply the gradient) and it comes up from time to time in physics.

Note: I have seem an alternaive form in the geoscience literature, J(f1,f2)\boldsymbol{J}(\boldsymbol{f}_1, \boldsymbol{f}_2). I personally don’t like this notation because in no way does it specify the determinant. I propose a better, clearer notation: detJ(f1,f2)\det \boldsymbol{J}(\boldsymbol{f}_1, \boldsymbol{f}_2). Now we at least have the


Example: This is in the QG PDE. It is given by:

tq+J(ψ,q)=0\partial_t q + \boldsymbol{J}(\psi, q) = 0

where the Jacobian operator is given by:

J(ψ,q)=xψyqyψxq\boldsymbol{J}(\psi, q) = \partial_x \psi \partial_y q - \partial_y \psi \partial_x q

With my updated notation, this would now be:

tq+detJ(ψ,q)=0\partial_t q + \det\boldsymbol{J}(\psi, q) = 0

where the determinant Jacobian operator is given:

detJ(ψ,q)=xψyqyψxq\det\boldsymbol{J}(\psi, q) = \partial_x \psi \partial_y q - \partial_y \psi \partial_x q

In my eyes, this is clearer. Especially in the papers where people recycle the equations without explicitly defining the operators and their meaning.