Trade-offs ¶ Pros ¶ Mesh-Free
Lots of Data
Cons ¶ Transfer Learning
Data ¶ x ϕ ∈ R D ϕ , u ∈ R \begin{aligned}
\mathbf{x}_\phi \in \mathbb{R}^{D_\phi}, \;\;\; \mathbf{u} \in \mathbb{R}^{}
\end{aligned} x ϕ ∈ R D ϕ , u ∈ R Model ¶ f θ : X → U \boldsymbol{f_\theta}:\mathcal{X} \rightarrow \mathcal{U} f θ : X → U Architectures ¶ We are interested in the case of regression. We have the following generalized architecture.
x ( 1 ) = ϕ ( x ; γ ) x ( ℓ + 1 ) = NN ℓ ( x ( ℓ ) ; θ ℓ ) f ( x ; θ , γ ) = w ( L ) x ( L ) + b ( L ) \begin{aligned}
\mathbf{x}^{(1)} &= \boldsymbol{\phi} \left( \mathbf{x} ; \boldsymbol{\gamma}\right) \\
\mathbf{x}^{(\ell+1)} &= \text{NN}_\ell \left( \mathbf{x}^{(\ell)}; \boldsymbol{\theta}_\ell\right)\\
\boldsymbol{f}(\mathbf{x}; \boldsymbol{\theta},\boldsymbol{\gamma}) &= \mathbf{w}^{(L)}\mathbf{x}^{(L)} + \mathbf{b}^{(L)}
\end{aligned} x ( 1 ) x ( ℓ + 1 ) f ( x ; θ , γ ) = ϕ ( x ; γ ) = NN ℓ ( x ( ℓ ) ; θ ℓ ) = w ( L ) x ( L ) + b ( L ) where ϕ \boldsymbol{\phi} ϕ is the basis transformation with some hyperparameters γ, NN \text{NN} NN is the neural network layer parameterized by θ \boldsymbol{\theta} θ , and we have L L L layers, L = { 1 , 2 , … , ℓ , … , L − 1 , L } L = \{1, 2, \ldots, \ell, \ldots, L-1, L\} L = { 1 , 2 , … , ℓ , … , L − 1 , L }
Standard Neural Network ¶ In the standard neural network, we typically have the following standard functions
ϕ ( x ) = x NN s i r e n ( x ( ℓ ) ; θ ) = σ ( w ( ℓ ) x ( ℓ ) + b ( ℓ ) ) , θ = { w ( ℓ ) , b ( ℓ ) } \begin{aligned}
\boldsymbol{\phi}(\mathbf{x}) &= \mathbf{x} \\
\text{NN}_{siren} \left( \mathbf{x}^{(\ell)}; \boldsymbol{\theta}\right) &= \boldsymbol{\sigma} \left( \mathbf{w}^{(\ell)} \mathbf{x}^{(\ell)} + \mathbf{b}^{(\ell)} \right), \hspace{10mm} \boldsymbol{\theta} = \{ \mathbf{w}^{(\ell)}, \mathbf{b}^{(\ell)} \}
\end{aligned} ϕ ( x ) NN s i re n ( x ( ℓ ) ; θ ) = x = σ ( w ( ℓ ) x ( ℓ ) + b ( ℓ ) ) , θ = { w ( ℓ ) , b ( ℓ ) } So more explicitly, we can write it as:
x ( 1 ) = x f ( ℓ ) ( x ( ℓ ) ) = σ ( w ( ℓ ) x ( ℓ ) + b ( ℓ ) ) f ( L ) ( x ( L ) ) = w ( L ) x ( L ) + b ( L ) \begin{aligned}
\mathbf{x}^{(1)} &= \mathbf{x} \\
\boldsymbol{f}^{(\ell)}(\mathbf{x}^{(\ell)}) &= \boldsymbol{\sigma} \left( \mathbf{w}^{(\ell)} \mathbf{x}^{(\ell)} + \mathbf{b}^{(\ell)} \right)\\
\boldsymbol{f}^{(L)}(\mathbf{x}^{(L)}) &= \mathbf{w}^{(L)}\mathbf{x}^{(L)} + \mathbf{b}^{(L)}
\end{aligned} x ( 1 ) f ( ℓ ) ( x ( ℓ ) ) f ( L ) ( x ( L ) ) = x = σ ( w ( ℓ ) x ( ℓ ) + b ( ℓ ) ) = w ( L ) x ( L ) + b ( L ) where ℓ = { 1 , 2 , … , L − 1 } \ell = \{1, 2, \ldots, L-1\} ℓ = { 1 , 2 , … , L − 1 } .
Noteably:
The first layer is the identity (i.e. there is not basis function transformation) The second layer is the standard neural network architecture, i.e. a linear function and a nonlinear activation function The final layer is always a linear function (in regression; classification would have a sigmoid) Positional Encoding ¶ Fourier Features ¶ ϕ ( x ) = [ sin ( ω x ) cos ( ω x ) ] , ω ∼ p ( ω ; γ ) \boldsymbol{\phi} \left(\mathbf{x}\right) =
\begin{bmatrix}
\sin \left( \boldsymbol{\omega}\mathbf{x}\right) \\
\cos \left( \boldsymbol{\omega} \mathbf{x}\right)
\end{bmatrix},\hspace{10mm} \boldsymbol{\omega} \sim p(\boldsymbol{\omega};\gamma) ϕ ( x ) = [ sin ( ω x ) cos ( ω x ) ] , ω ∼ p ( ω ; γ ) Method Kernel Distribution Gaussian N ( 0 , 1 σ 2 I r ) \mathcal{N}(\mathbf{0},\frac{1}{\sigma^2}\mathbf{I}_r) N ( 0 , σ 2 1 I r ) Laplacian Cauchy ( ) \text{Cauchy}() Cauchy ( ) Cauchy Laplace ( ) \text{Laplace}() Laplace ( ) Matern Bessel ( ) \text{Bessel}() Bessel ( ) ArcCosine
ϕ ( x ) = 2 D r f f cos ( ω x + b ) \boldsymbol{\phi}(\mathbf{x}) = \sqrt{\frac{2}{D_{rff}}}\cos \left( \boldsymbol{\omega}\mathbf{x} + \boldsymbol{b}\right) ϕ ( x ) = D r ff 2 cos ( ω x + b ) where ω ∼ p ( ω ) \boldsymbol{\omega} \sim p(\boldsymbol{\omega}) ω ∼ p ( ω ) and b ∼ U ( 0 , 2 π ) \boldsymbol{b} \sim \mathcal{U}(0,2\pi) b ∼ U ( 0 , 2 π ) .
Source :
Blog - Gregory GundersenRandom Features for Large-Scale Kernel Machines - Rahimi & Recht (2008) - Paper Random Features for Kernel Approximation: A Survey on Algorithms, Theory, and Beyond - Liu et al (2021) Scalable Kernel Methods via Doubly Stochastic Gradients - Dai et al (2015) - SIREN ¶ σ = sin ( ω 0 ( w x + b ) ) \boldsymbol{\sigma} = \sin \left( \boldsymbol{\omega}_0 (\mathbf{wx} + b)\right) σ = sin ( ω 0 ( wx + b ) ) σ = α ⊙ sin ( w x + b ) \boldsymbol{\sigma} = \boldsymbol{\alpha} \odot \sin \left( \mathbf{wx} + \mathbf{b} \right) σ = α ⊙ sin ( wx + b ) FiLM ( x ) = α ⊙ x + β \text{FiLM}(\mathbf{x}) = \boldsymbol{\alpha} \odot \mathbf{x} + \boldsymbol{\beta} FiLM ( x ) = α ⊙ x + β pi-GAN: Periodic Implicit Generative Adversarial Networks for 3D-Aware Image Synthesis - Chan et al (2021) COIN++: Extended ¶ σ = sin ( γ ( w x + b ) + β ) \begin{aligned}
\boldsymbol{\sigma} = \sin \left( \boldsymbol{\gamma}(\mathbf{wx} + b\right) + \boldsymbol{\beta})
\end{aligned} σ = sin ( γ ( wx + b ) + β ) where γ \boldsymbol{\gamma} γ corresponds to the frequencies and β \boldsymbol{\beta} β corresponds to the phase shifts.
Modulation ¶ Modulation is
f ℓ ( x , z ; θ ) : = h M ℓ ( NN ( x ; θ N N ) , M ( z ; θ M ) )
\boldsymbol{f}^\ell(\mathbf{x},\mathbf{z};\boldsymbol{\theta}) := \boldsymbol{h}_M^\ell\left(\;\text{NN}(\mathbf{x};\boldsymbol{\theta}_{NN})\;,\; \text{M}(\mathbf{z};\boldsymbol{\theta}_{M}) \;\right) f ℓ ( x , z ; θ ) := h M ℓ ( NN ( x ; θ NN ) , M ( z ; θ M ) ) where N N NN NN is the output of the neural network wrt the input, x \mathbf{x} x , where M M M is the output of the modulation function wrt the latent variable, z \mathbf{z} z , and × is an arbitrary operator.
Additive Layer Affine Layer Neural Implicit Flowss Neural Flows FILM, 2020
Mehta, 2021
Dupoint, 2022
Neural Implicit Flows Pan, 2022
Neural
Affine Modulation ¶ Affine Modulations
z ( 1 ) = x z ( k + 1 ) = σ ( ( w ( k ) z ( k ) + b ( k ) ) ⊙ s m ( z ) + a m ( z ) ) f ( x ) = w K z K + b ( k ) \begin{aligned}
\mathbf{z}^{(1)} &= \mathbf{x} \\
\mathbf{z}^{(k+1)} &= \boldsymbol{\sigma} \left( \left(\mathbf{w}^{(k)} \mathbf{z}^{(k)} + \mathbf{b}^{(k)}\right)\odot \boldsymbol{s}_m(\mathbf{z}) + \boldsymbol{a}_m(\mathbf{z}) \right)\\
\boldsymbol{f}(\mathbf{x}) &= \mathbf{w}^{K}\mathbf{z}^{K} + \mathbf{b}^{(k)}
\end{aligned} z ( 1 ) z ( k + 1 ) f ( x ) = x = σ ( ( w ( k ) z ( k ) + b ( k ) ) ⊙ s m ( z ) + a m ( z ) ) = w K z K + b ( k ) Shift Modulations
Neural Implicit Flows ¶ In this work, we have a version of the Modulated Siren as mentioned above. However, they use a version that separates the space and time neural networks.
f ( x ϕ , t ) = NN s p a c e ( x ϕ ; NN t i m e ( t ) )
\boldsymbol{f}(\mathbf{x}_\phi, t) = \text{NN}_{space}(\mathbf{x}_\phi;\text{NN}_{time}(t)) f ( x ϕ , t ) = NN s p a ce ( x ϕ ; NN t im e ( t )) Multiplicative Filter Networks ¶ z ( 1 ) = x z ( k + 1 ) = σ ( w ( k ) z ( k ) + b ( k ) ) f ( x ) = w K z K + b ( k ) \begin{aligned}
\mathbf{z}^{(1)} &= \mathbf{x} \\
\mathbf{z}^{(k+1)} &= \boldsymbol{\sigma} \left( \mathbf{w}^{(k)} \mathbf{z}^{(k)} + \mathbf{b}^{(k)} \right) \\
\boldsymbol{f}(\mathbf{x}) &= \mathbf{w}^{K}\mathbf{z}^{K} + \mathbf{b}^{(k)}
\end{aligned} z ( 1 ) z ( k + 1 ) f ( x ) = x = σ ( w ( k ) z ( k ) + b ( k ) ) = w K z K + b ( k ) where K = { 1 , 2 , … , K } K = \{1, 2, \ldots, K\} K = { 1 , 2 , … , K }
Non-Linear Functions ¶ FOURIERNET
This method corresponds to the random Fourier Feature transformation.
g ( ℓ ) ( x ; θ ( ℓ ) ) = sin ( w ( ℓ ) x + b ( ℓ ) ) \boldsymbol{g}^{(\ell)}(\mathbf{x};\boldsymbol{\theta}^{(\ell)}) = \sin\left( \mathbf{w}^{(\ell)}\mathbf{x} + \mathbf{b}^{(\ell)}\right) g ( ℓ ) ( x ; θ ( ℓ ) ) = sin ( w ( ℓ ) x + b ( ℓ ) ) where the parameters to be learned are:
θ ( ℓ ) = { w d ( ℓ ) , b d ( ℓ ) } \boldsymbol{\theta}^{(\ell)} = \{\mathbf{w}_d^{(\ell)}, \;\; \mathbf{b}^{(\ell)}_d \} θ ( ℓ ) = { w d ( ℓ ) , b d ( ℓ ) } GABORNET
This method tries to improve upon the Fourier representation. The Fourier representation has global support and would have more difficulties representing more local features. The Gabor filter (see below) will be able to capture both frequency and spatial locality component.
g ( ℓ ) ( x ; θ ( ℓ ) ) = exp ( − γ d ( ℓ ) 2 ∣ ∣ x − μ d ( ℓ ) ∣ ∣ 2 2 ) ⊙ sin ( w ( ℓ ) x + b ( ℓ ) ) \boldsymbol{g}^{(\ell)}(\mathbf{x};\boldsymbol{\theta}^{(\ell)}) = \exp\left( - \frac{\gamma_d^{(\ell)}}{2}||\mathbf{x} - \boldsymbol{\mu}_d^{(\ell)}||_2^2 \right) \odot \sin\left( \mathbf{w}^{(\ell)}\mathbf{x} + \mathbf{b}^{(\ell)}\right) g ( ℓ ) ( x ; θ ( ℓ ) ) = exp ( − 2 γ d ( ℓ ) ∣∣ x − μ d ( ℓ ) ∣ ∣ 2 2 ) ⊙ sin ( w ( ℓ ) x + b ( ℓ ) ) where the parameters to be learned are:
θ ( ℓ ) = { γ d ( ℓ ) ∈ R , μ d ( ℓ ) , w d ( ℓ ) , b d ( ℓ ) } \boldsymbol{\theta}^{(\ell)} = \{ \gamma_d^{(\ell)} \in \mathbb{R},\;\;\boldsymbol{\mu}_d^{(\ell)}, \;\; \mathbf{w}_d^{(\ell)}, \;\; \mathbf{b}^{(\ell)}_d \} θ ( ℓ ) = { γ d ( ℓ ) ∈ R , μ d ( ℓ ) , w d ( ℓ ) , b d ( ℓ ) } Sources :
Probabilistic ¶ Deterministic ¶ L ( θ ) = argmin θ λ ∑ n ∈ D ∣ ∣ f ( x n ; θ ) − u n ∣ ∣ 2 2 − log p ( θ ) \mathcal{L}(\boldsymbol{\theta}) = \underset{\boldsymbol{\theta}}{\text{argmin }} \lambda \sum_{n \in \mathcal{D}} ||\boldsymbol{f}(\mathbf{x}_n;\boldsymbol{\theta}) - \boldsymbol{u}_n||_2^2 - \log p(\boldsymbol{\theta}) L ( θ ) = θ argmin λ n ∈ D ∑ ∣∣ f ( x n ; θ ) − u n ∣ ∣ 2 2 − log p ( θ ) Normalizing Flows ¶ Bayesian ¶ Random Feature Expansions (RFEs) Physics Constraints ¶ Mass ¶ Momentum ¶ QG Equations ¶ Applications ¶ Interpolation ¶ Surrogate Modeling ¶ Sampling ¶ Feature Engineering ¶ x ∈ R D ϕ , D = { lat, lon, time } \mathbf{x} \in \mathbb{R}^{D_\phi}, \hspace{10mm} D = \{ \text{lat, lon, time} \} x ∈ R D ϕ , D = { lat, lon, time } Spatial Features ¶ For the spatial features, we have spherical coordinates (i.e. longitude and latitude)
x = r cos ( λ ) cos ( ϕ ) y = r cos ( λ ) sin ( ϕ ) z = r sin ( λ ) \begin{aligned}
x &= r \cos(\lambda)\cos(\phi) \\
y &= r \cos(\lambda)\sin(\phi) \\
z &= r \sin(\lambda)
\end{aligned} x y z = r cos ( λ ) cos ( ϕ ) = r cos ( λ ) sin ( ϕ ) = r sin ( λ ) where λ is the latitude, ϕ is the longitude and r r r is the radius. Here x , y , z x,y,z x , y , z are bounded between 0 and 1.
Temporal Features ¶ Tanh ¶ f ( t ) = tanh ( t ) f(t) = \tanh(t) f ( t ) = tanh ( t ) Fourier Features ¶ Sinusoidal Positional Encoding ¶ ϕ ( t ) = [ sin ( ω k t ) cos ( ω k t ) ] \boldsymbol{\phi}(t) =
\begin{bmatrix}
\sin(\boldsymbol{\omega}_k t) \\
\cos(\boldsymbol{\omega}_k t)
\end{bmatrix} ϕ ( t ) = [ sin ( ω k t ) cos ( ω k t ) ] where
ω k = 1 10 , 00 0 2 k d
\boldsymbol{\omega}_k = \frac{1}{10,000^{\frac{2k}{d}}} ω k = 10 , 00 0 d 2 k 1 Sources :
Transformer Architecture: The Positional Encoding - Amihossein - Blog Position Information in Transformers: An Overview - Dufter et al (2021) - Arxiv - Paper Rethinking Positional Encoding - Zheng et al (2021) - Arxiv Paper Self-Attention with Functional Time Representation Learning - Xu et al (2019) - Arxiv - Paper AI Coffee Break with Letita - Video 1 | 2 Attention is all you need. A Transformer Tutorial: 5. Positional Encoding - Video Experiments ¶ Initial Conditions ¶ Training Time, Convergence
Random Intitialization Feature-Wise Interpolation PyInterp (2D) Markovian Gaussian Process (MGP) Optimal Interpolation (OI) Iterative Schemes ¶ Speed, Accuracy, PreTraining
Projection-Based Gradient-Based Fixed-Point Iteration Anderson Acceleration CNN + Gradient LSTM Priors ¶ The impact on the priors on the learning procedure.
Deterministic Probabilistic Deterministic ¶ ODE (Fixed) PCA (Fixed) ODE (Learnable) PCA (Learnable) UNet Probabilistic ¶ UNet + DropOut Probabilistic UNet