Temporal Dependencies ¶ In this section, we outline some of the ways we can construct parametric functions which have parameters with temporal dependencies.
θ ( t ) = θ 0 + θ 1 ϕ ( t ) + ϵ θ \theta(t) = \theta_0 + \theta_1\phi(t) + \epsilon_\theta θ ( t ) = θ 0 + θ 1 ϕ ( t ) + ϵ θ where ψ ( t ) \psi(t) ψ ( t ) is some function which encodes time, e.g., linear, log-linear, Fourier, etc.
Encoding Time ¶ ϕ = ϕ ( t ) ϕ : R + → R D t \phi = \phi(t)
\hspace{10mm}
\phi: \mathbb{R}^+\rightarrow\mathbb{R}^{D_t} ϕ = ϕ ( t ) ϕ : R + → R D t The most general form would be Fourier series
ϕ ( t ) = ∑ k = 1 K a k sin ( 2 π k t P ) + b k cos ( 2 π k t P ) \phi(t) = \sum_{k=1}^K
a_k \sin\left(\frac{2\pi k t}{P}\right)
+ b_k \cos\left(\frac{2\pi k t}{P}\right) ϕ ( t ) = k = 1 ∑ K a k sin ( P 2 πk t ) + b k cos ( P 2 πk t ) where t t t is time, P P P is the base period of the feature, n n n is the index of the series.
We could go more difficult by adding an envelope
ϕ ( t ) = ∑ k = 1 K a k F ~ n ( t ) sin ( 2 π k t P ) + b k F ~ n cos ( 2 π k t P ) \phi(t) = \sum_{k=1}^K
a_k \tilde{F}_n(t)\sin\left(\frac{2\pi k t}{P}\right)
+ b_k \tilde{F}_n\cos\left(\frac{2\pi k t}{P}\right) ϕ ( t ) = k = 1 ∑ K a k F ~ n ( t ) sin ( P 2 πk t ) + b k F ~ n cos ( P 2 πk t ) Constant ¶ We can use maximum likelihood to estimate the unknown parameters of the parametric distributions.
L ( θ ) = ∏ n = 1 N T f ( y n ) ∏ n = 1 N T S ( y n ) L(\boldsymbol{\theta}) =
\prod_{n=1}^{N_T} f(y_n) \prod_{n=1}^{N_T}S(y_n) L ( θ ) = n = 1 ∏ N T f ( y n ) n = 1 ∏ N T S ( y n ) We can rewrite this as a
log L ( θ ) = ∑ n = 1 N T log f ( y n ) − ∑ n = 1 N T F ( y n ) \log L(\boldsymbol{\theta}) =
\sum_{n=1}^{N_T} \log f(y_n) -
\sum_{n=1}^{N_T}F(y_n) log L ( θ ) = n = 1 ∑ N T log f ( y n ) − n = 1 ∑ N T F ( y n ) where f f f is the PDF of some density function and F F F is the CDF of some density function.
Parametric ¶ λ ∗ ( t , s ) = μ ( t , s ) + g ( t , s ) \lambda^*(t,\mathbf{s}) = \boldsymbol{\mu}(t,\mathbf{s}) + \boldsymbol{g}(t,\mathbf{s}) λ ∗ ( t , s ) = μ ( t , s ) + g ( t , s ) Typically, we can use some kernel function
f ( t ∣ θ ) = ∑ n = 1 N s w n k ( t , θ n ) f(t|\boldsymbol{\theta}) =
\sum_{n=1}^{N_s}
\mathbf{w}_n
\boldsymbol{k}(t,\boldsymbol{\theta}_n) f ( t ∣ θ ) = n = 1 ∑ N s w n k ( t , θ n ) where k \boldsymbol{k} k is a kernel function (density function) with parameters θ n \boldsymbol{\theta}_n θ n .
The w n \mathbf{w}_n w n are weights corresponding to the densities k ( t , θ n ) \boldsymbol{k}(t,\boldsymbol{\theta}_n) k ( t , θ n ) and ∑ n = 1 N s w n = 1 \sum_{n=1}^{N_s}\mathbf{w}_n=1 ∑ n = 1 N s w n = 1 .
Kernel ¶ A Hawkes Process is a process that has a background rate and a self-exciting term.
This term encourages more events to happen given past events.
g ( t , s ) = g t ( t ) g s ( s ) \boldsymbol{g}(t,\mathbf{s}) = g_t(t)\boldsymbol{g}_s(\mathbf{s}) g ( t , s ) = g t ( t ) g s ( s ) For this paper, they used a separable, parametric kernel function, i.e., a Hawkes kernel
In particular, they used a Gaussian kernel in space and exponential in time.
g ( t , s ) = α β exp ( − β t ) 1 2 π ∣ Σ ∣ exp ( − s ⊤ Σ − 1 s ) \boldsymbol{g}(t,\mathbf{s}) =
\alpha\beta\exp(-\beta t)
\frac{1}{\sqrt{2\pi|\mathbf{\Sigma}|}}
\exp(-\mathbf{s}^\top\mathbf{\Sigma}^{-1}\mathbf{s}) g ( t , s ) = α β exp ( − βt ) 2 π ∣ Σ ∣ 1 exp ( − s ⊤ Σ − 1 s ) where α , β > 0 \alpha,\beta>0 α , β > 0 and Σ \mathbf{\Sigma} Σ as a PSD matrix.
Doubly Stochastic ¶ This is also known as a Log-Gaussian Cox process.
log μ ( t , s ) = f t ( t ) + f s ( s ) \log \boldsymbol{\mu}(t,\mathbf{s}) =
\boldsymbol{f}_t(t) +
\boldsymbol{f}_s(\mathbf{s}) log μ ( t , s ) = f t ( t ) + f s ( s ) where f f f is a realization from a Gaussian process
f ∼ G P ( m θ , k θ ) \boldsymbol{f} \sim \mathcal{GP}(\boldsymbol{m_\theta},\boldsymbol{k_\theta}) f ∼ G P ( m θ , k θ ) If we assume that our hazard function is a constant function
h ( t ) = λ h(t) = \lambda h ( t ) = λ This implies that our survival function is
S ( t ) = exp ( − λ t ) S(t) = \exp(-\lambda t) S ( t ) = exp ( − λ t ) Approximating Integrals ¶ Let’s say we are given some pixel locations
H = { s n } n = 1 N \mathcal{H} = \left\{\mathbf{s}_n \right\}_{n=1}^N H = { s n } n = 1 N Monte Carlo ¶ ∫ f ( t ) ≈ ∑ n = 1 N Δ t n [ 1 K ∑ k = 1 K f ( t n + Δ t n z k ) ] z k ∼ U [ 0 , 1 ] \int f(t) \approx
\sum_{n=1}^N \Delta t_n
\left[
\frac{1}{K}\sum_{k=1}^K
f(t_n + \Delta_{t_n}z_k)
\right]
\hspace{10mm}
z_k \sim U[0,1] ∫ f ( t ) ≈ n = 1 ∑ N Δ t n [ K 1 k = 1 ∑ K f ( t n + Δ t n z k ) ] z k ∼ U [ 0 , 1 ] ODE Time Steppers ¶ We can write our an equation of motion (EoM) describing the state transitions.
∂ t λ = f θ ( λ , t ) f : R D λ × R + → R D λ \begin{aligned}
\partial_t \lambda &= \boldsymbol{f_\theta}(\lambda, t)
&& && &&
\boldsymbol{f}: \mathbb{R}^{D_\lambda}\times\mathbb{R}^+
\rightarrow
\mathbb{R}^{D_\lambda}
\end{aligned} ∂ t λ = f θ ( λ , t ) f : R D λ × R + → R D λ Now, we can integrate this in time which is given by the fundamental theorem of calculus.
λ ( t ) = λ 0 + ∫ 0 t f θ ( λ 0 , τ ) d τ \lambda(t) = \lambda_0 + \int_0^t\boldsymbol{f_\theta}(\lambda_0, \tau)d\tau λ ( t ) = λ 0 + ∫ 0 t f θ ( λ 0 , τ ) d τ In practice, we know that we have to do some numerical approximations, e.g., Euler (Taylor Expansion) or RK4 (Quadrature).
This leaves us with a time stepper function
λ t 1 : = ϕ t ( λ ) = λ t 0 + ∫ t 0 t 1 f θ ( λ t 0 , τ ) d τ \lambda_{t_1} := \boldsymbol{\phi}_t(\boldsymbol{\lambda}) = \lambda_{t_0} + \int_{t_0}^{t_1}\boldsymbol{f_\theta}(\lambda_{t_0}, \tau)d\tau λ t 1 := ϕ t ( λ ) = λ t 0 + ∫ t 0 t 1 f θ ( λ t 0 , τ ) d τ which we can apply in an autoregressive fashion
λ T = ϕ t T ∘ ϕ t T − 1 ∘ … ∘ ϕ t 1 ∘ ϕ t 0 ( λ 0 )
\boldsymbol{\lambda}_T =
\boldsymbol{\phi}_{t_T} \circ \boldsymbol{\phi}_{t_{T-1}} \circ
\ldots
\circ
\boldsymbol{\phi}_{t_1}
\circ
\boldsymbol{\phi}_{t_0}(\boldsymbol{\lambda}_0) λ T = ϕ t T ∘ ϕ t T − 1 ∘ … ∘ ϕ t 1 ∘ ϕ t 0 ( λ 0 ) This is more commonly represented where we add a vector of time points we wish to receive from the ODESOlver, t = [ t 0 , t 1 , … , T ] \mathbf{t}=[t_0, t_1, \ldots, T] t = [ t 0 , t 1 , … , T ] and pass this through the ODESolver
λ = ODESolver ( f , λ 0 , t , θ ) \boldsymbol{\lambda} = \text{ODESolver}
\left(\boldsymbol{f},\lambda_0, \mathbf{t},\boldsymbol{\theta}\right) λ = ODESolver ( f , λ 0 , t , θ ) where λ = [ λ 0 , λ 1 , … , λ T ] \boldsymbol{\lambda}=[\lambda_0, \lambda_1, \ldots, \lambda_T] λ = [ λ 0 , λ 1 , … , λ T ] .
This is a nice abstraction which allows the user to choose any arbitrary solver like the Euler, Heun or Runge-Kutta.
Quadrature ¶ ∫ Ω λ ( s ) d s ≈ ∑ n = 1 N s w n λ ( s n ) \int_{\Omega}\boldsymbol{\lambda}(\mathbf{s})d\mathbf{s} \approx
\sum_{n=1}^{N_s}\mathbf{w}_n\boldsymbol{\lambda}(\mathbf{s}_n) ∫ Ω λ ( s ) d s ≈ n = 1 ∑ N s w n λ ( s n ) where w n > 0 w_n>0 w n > 0 and ∑ n = 1 N w = ∣ Ω ∣ \sum_{n=1}^Nw=|\Omega| ∑ n = 1 N w = ∣Ω∣ and s n = 1 , 2 , … , N s \mathbf{s}_n=1,2,\ldots,N_s s n = 1 , 2 , … , N s are all of the points in the domain, Ω \mathcal{\Omega} Ω .
This yields an approximation to the log likelihood as
log p ( Ω ) ≈ ∑ n = 1 N log λ ( s n ) − ∑ n = 1 N s w n λ ( s n ) \log p(\mathcal{\Omega}) \approx
\sum_{n=1}^N\log \boldsymbol{\lambda}(\mathbf{s}_n)
-\sum_{n=1}^{N_s}\mathbf{w}_n\boldsymbol{\lambda}(\mathbf{s}_n) log p ( Ω ) ≈ n = 1 ∑ N log λ ( s n ) − n = 1 ∑ N s w n λ ( s n ) We have 3 sets of points:
N s N_s N s - the points within the discretized domainN ( s n ) N(\mathbf{s}_n) N ( s n ) - the points within the discretized domain where we have events.N ˉ s = N ( s n ) − N s \bar{N}_s = N(\mathbf{s}_n)-N_s N ˉ s = N ( s n ) − N s - the points in the discretized domain where there are no events.Berman & Turner (1992) used the names design points, data points, and dummy points, respectively.
Let’s create a mask vector which represents the case whether or not we observe an event within an element.
m n = { 1 , N ( s n ) ≥ 0 0 , N ( s n ) = 0 \mathbf{m}_n =
\begin{cases}
1, && && N(\mathbf{s}_n) \geq 0 \\
0, && && N(\mathbf{s}_n) = 0
\end{cases} m n = { 1 , 0 , N ( s n ) ≥ 0 N ( s n ) = 0 Now, we can rewrite the log-likelihood term to be
log p ( Ω ) ≈ ∑ n = 1 N s w n ( m n w n log λ ( s n ) − λ ( s n ) ) \log p(\mathcal{\Omega}) \approx
\sum_{n=1}^{N_s}
\mathbf{w}_n
\left(
\frac{\mathbf{m}_n}{\mathbf{w}_n}
\log\boldsymbol{\lambda}(\mathbf{s}_n) -
\boldsymbol{\lambda}(\mathbf{s}_n)
\right) log p ( Ω ) ≈ n = 1 ∑ N s w n ( w n m n log λ ( s n ) − λ ( s n ) ) Pixel Counts ¶ We can divide the domain, Ω \mathcal{\Omega} Ω , into small pixels of each area, ω.
Then the integral over the domain, Ω \mathcal{\Omega} Ω , is approximated by summing over all of the pixels.
∫ Ω λ ( s ) d s ≈ ∑ n = 1 N s ω n λ ( s n ) \int_{\Omega}\lambda(\mathbf{s})d\mathbf{s} \approx
\sum_{n=1}^{N_s}\omega_n\lambda(\mathbf{s}_n) ∫ Ω λ ( s ) d s ≈ n = 1 ∑ N s ω n λ ( s n ) where s n s_n s n is the center of the n n n th pixel.
In this case, we discard the exact locations of the data points and we mark each data point with the center pixel.
To approximate the sum over data points, we can take the sum over pixels.
∑ n N log λ ( s n ) ≈ ∑ n = 1 N ( s n ) log λ ( s n ) \sum_n^N\log\lambda(\mathbf{s}_n) \approx
\sum_{n=1}N(\mathbf{s}_n)\log \lambda(\mathbf{s}_n) n ∑ N log λ ( s n ) ≈ n = 1 ∑ N ( s n ) log λ ( s n ) where N ( s n ) N(s_n) N ( s n ) is the number of data points falling into the n n n th pixel.
So we can take these two quantities and put them together to get
log p ( Ω ) ≈ ∑ n = 1 N ( N ( s n ) log λ ( s n ) − ω n λ ( s n ) ) \log p(\mathcal{\Omega}) \approx
\sum_{n=1}^N
\left(
N(\mathbf{s}_n)\log \lambda(\mathbf{s}_n)
-
\omega_n\lambda(\mathbf{s}_n)
\right) log p ( Ω ) ≈ n = 1 ∑ N ( N ( s n ) log λ ( s n ) − ω n λ ( s n ) ) Masks ¶ We can make this a bit neater by introducing a masking variable, m \boldsymbol{m} m .
This variable acts as an indicator variable
m n = { 1 , y > y 0 0 , y ≤ y 0 m_n =
\begin{cases}
1, && y > y_0 \\
0, && y \leq y_0
\end{cases} m n = { 1 , 0 , y > y 0 y ≤ y 0 Now, we can use this indicator variable to mask the likelihood function to zero if the observed value at the temporal resolution is above or below the threshold, y 0 y_0 y 0 .
log p ( y n ∣ θ ) = ∑ n = 1 N ( A ) m n log f G E V D ( y n ; θ ) − T y e a r s S G E V D ( y 0 ; θ ) + log ∑ n = 1 N ( A ) f G P D ( y n ; θ ) \begin{aligned}
\log p(y_n|\boldsymbol{\theta}) &=
\sum_{n=1}^{N(A)}m_n\log \boldsymbol{f}_{GEVD}(y_n;\boldsymbol{\theta}) \\
&- T_{years}\mathbf{S}_{GEVD}(y_0;\boldsymbol{\theta}) \\
&+\log \sum_{n=1}^{N(A)}\boldsymbol{f}_{GPD}(y_n;\boldsymbol{\theta})
\end{aligned} log p ( y n ∣ θ ) = n = 1 ∑ N ( A ) m n log f GE V D ( y n ; θ ) − T ye a rs S GE V D ( y 0 ; θ ) + log n = 1 ∑ N ( A ) f GP D ( y n ; θ ) Demo Code ¶ For the spatiotemporal case:
# initialize the event times
T: Array["T"] = ...
# initialize the spatial events points
S: Array["T S"] = ...
# initialize the marks at the events in space and time
Y: Array["T S Dy"] = ..
# initialize mask at marks
M: Array["T S Dy"] = ...