Similarity#

What is it?#

When making comparisons between objects, the simplest question we can ask ourselves is how ‘similar’ one object is to another. It’s a simple question but it’s very difficult to answer. Simalirity in everyday life is somewhat easy to grasp intuitively but it’s not easy to convey specific instructions to a computer. A 1. For example, the saying, “it’s like comparing apples to oranges” is usually said when you try to tell someone that something is not comparable. But actually, we can compare apples and oranges. We can compare the shape, color, composition and even how delicious we personally think they are. So let’s consider the datacube structure that was mentioned above. How would we compare two variables \(z_1\) and \(z_2\).


Constraints#

  1. Invariance to Rotations

  2. Invariance to Permutations

  3. Invariance to Isotropic Scaling

  4. Invariance to Linear Transformations

  5. Invariance to Sample Size

Transformations#

Affect the size and position of the data

  • Position -> Translation, Rotation, Reflection

  • Size -> Dilation (e.g. isotropic scaling)

Translation: slides across a plane or through space (some distance, some direction)

  • Transformation that slides across a plane or through a space

  • All points of a figure/dataset move the same distance and in the same direction

  • “change of location”, specified by direction and distance

Rotation: turns a figure around a point or line

  • spin in the shape

  • turn a figure

  • reflection symmetry

Reflection: transformation that flips across a line

  • rotation symmetry

Dilation: changes the size but not the actual shape

  • e.g. enlarge, reduce

  • usually given as a scale factor

\[ \alpha \mathbf{X} \mathbf{R} + b \]

Orthogonal Transformations#

Linear Transformations#

Isotropic Scaling#

Multivariate#

Curse of Dimensionality#


Classic Methods#

These are some notes based off of an excellent review paper [].

  • Evaluate Similarity of the whole config of units

  • Transform the data into an \(N\times N\) square matrix, b) evaluate the similarity.

Congruence Coefficient#

  • Unadjusted Correlation - []

  • Congruence - []

  • Monotonicty Coefficient - []

Sensitive to patters of similarity byt not to rotations of dilations.

RV Coefficient#

Early instance of a natural generalization of the notion of correlation to groups of variables. Can explain everything and can be generalized to almost every other method!

  • []

    Measure the similarity between squared matrices (PSD); multivariate analysis techniques

  • Abdi (2003), Abdi & Valentin (2007), Abdi (2007, 2009), Holmes (1989, 2007)

    STATB; DISTATIS Transform rectangular matrices into squared matrices Proof Abdi 2007a,

Single Sample, Multiple Features#

\(\mathbf{x,y}\in \mathcal{R}^D\).

\[ \rho V(\mathbf{x,y}) = \frac{\text{Tr}\left( \sum_{\mathbf{xy}} \sum_{\mathbf{yx}}\right)}{\sqrt{\text{Tr}\left( \sum_{\mathbf{xx}}^2\right)\text{Tr}\left( \sum_{\mathbf{yy}}^2\right)}} \]
  • Relation to Pearson Correlation Coefficient: \(D=1, \rho V=\rho^2\)

  • \(\rho V \in [0,1]\)

  • \(\rho V (\mathbf{x}, a\mathbf{xr}+c)=1\)

Connections

  • []; Connections between RV and PCA, LDA, LR, etc.

Multiple Samples, Multiple Features#

Considers two sets of variables similar if the relative positions of the observations in one set is similar to the relative positions of another.

\(\mathbf{X,Y}\in \mathcal{R}^{N \times D}\). Let’s create some matrices describing the empirical covaraince: \(\sum_\mathbf{XY} = \frac{1}{N-1}\mathbf{X^\top Y}\in\mathcal{R}^{D \times D}\). We can summarize this as follows:

\[ \rho_\text{RV}(\mathbf{X,Y}) = \frac{\text{Tr}(\sum_\mathbf{XY}\sum_\mathbf{YX})}{\sqrt{\text{Tr}\left(\sum_\mathbf{XX}^2\right) \text{Tr}\left(\sum_\mathbf{YY}^2\right)}} \]

Sample Space#

Consider matrices in the sample space: \(\mathbf{K_x}=\mathbf{XX^\top}\in\mathcal{R}^{N \times N}\). By taking the Hilbert-Schmidt norm, we can measure their proximity:

\[ \langle \mathbf{K_X, K_Y}\rangle_F = \text{Tr}(\mathbf{K_X K_Y}) = \sum_i^{D_\mathbf{x}}\sum_j^{D_\mathbf{y}} \text{Cov}^2(\mathbf{X_{.i}},\mathbf{Y_{.j}}) \]
\[\begin{split} \begin{aligned} \rho_\text{RV}(\mathbf{X,Y}) &= \frac{\langle \mathbf{K_X}\mathbf{K_Y}\rangle_F}{||\mathbf{K_X}||_F ||\mathbf{K_Y}||_F } \\ &= \frac{\text{Tr}(\mathbf{K_X}\mathbf{K_Y})}{\sqrt{\text{Tr}\left(\mathbf{K_X}\right)^2 \text{Tr}\left(\mathbf{K_Y}\right)^2}} \end{aligned} \end{split}\]

Relation#

Comparing features is the same as comparing samples. [] (slides)

(30)#\[ ||\mathbf{X}^\top \mathbf{Y}||_F^2 = \text{Tr}\left(\mathbf{XX^\top YY^\top} \right) \]
  • The LHS is the sum of squared dot product similarities between the features

  • The RHS is the dot product between reshaped inter-example similarity matrices

Proof

A very simple proof as we simply need to expand the following the trace term using the “cyclic property” of matrices.

\[\begin{split} \begin{aligned} \text{Tr}\left(\mathbf{XX^\top YY^\top} \right) &= \text{Tr}\left(\mathbf{X^\top YY^\top X} \right)\\ &= \text{Tr}\left(\mathbf{X^\top Y} \right)^2 \\ &= ||\mathbf{X}^\top \mathbf{Y}||_F^2 \\ \end{aligned} \end{split}\]

In the same way as the \(\rho\) coefficient, we can normalize both sides of the equation to get equivalent bounded correlations:

\[ \frac{||\mathbf{X}^\top \mathbf{Y}||_F^2}{||\mathbf{X}^\top \mathbf{X}||_F||\mathbf{Y}^\top \mathbf{Y}||_F} = \frac{\text{Tr}\left(\mathbf{XX^\top YY^\top} \right)}{||\mathbf{X} \mathbf{X}^\top||_F||\mathbf{Y} \mathbf{Y}^\top||_F} \]

as shown in []


Distance Correlation#

  • Originally proposed by: [].

\[ K_{ij} = - \frac{1}{2}(d_{ij}^2 - d_{i.}^2 - d_{.j}^2 + d_{..}^2) \]

where \(.\) is the mean sum.

\[ \rho_\text{RV} = \frac{\langle H\triangle_x^2H, H\triangle_y^2H \rangle_F}{|| H\triangle_x^2H||_F ||H\triangle_y^2H ||_F} \]

where:

(27)#\[ \mathbf{H}=\mathbf{I}_N - \left(\frac{1}{N} \right)\mathbf{1}_N \]
  1. The features are standardized to have unit variance. So we have to be careful with the preprocessing.

  2. It’s a unifying tool that maximizes the association coefficients under certain constraints.

Assuming Euclidean distance, we can have:

\[ \text{dCor}(\mathbf{X,Y}) = \frac{\langle H\triangle_xH, H\triangle_yH \rangle_F}{|| H\triangle_xH||_F ||H\triangle_yH ||_F} \]
  • We use \(\triangle\) instead of square matrices

  • \(\text{dCorr}\) can detect non-linear relationships, RV can only detect linear relationships

  • If we don’t square the distances, we can pick up more complex relationships

  • Statistical consistency: \(n \rightarrow \infty\)

  • \(0 \leq \text{dCor}(\mathbf{X,Y}) \leq 1\)

  • \((\mathbf{X,Y})=0\) iff \(\mathbf{X,Y}\) are independent

  • \(\text{dCor}(\mathbf{X}, a \mathbf{XB} + c)=1\)

Connections

  • PCA []

  • Discriminant Analysis, []

  • Canonical Analysis, []

  • Multivariate Regression []


Kernel Methods#

  • MMD -> distance between the joint distribution of two random variables & the product of their marginal distributions

  • HSIC

(28)#\[ \rho_\text{HSIC}(\mathbf{X,Y}) = \frac{1}{(N-1)^2}\text{Tr}(\tilde{\mathbf{K}}_\mathbf{x}\tilde{\mathbf{K}}_\mathbf{y}) \]

where: \(\tilde{\mathbf{K}}=\mathbf{HKH}\), \(\mathbf{H}\) is the same centering matrix as above in equation (27).

Connections:

  • []: some talks, Slides | Slides

  • []; multivariate kernel methods in the analysis of graphs

  • []; Kernel Tangent Alignment -> Normalized version HSIC (w/o centering)

  • []; Centered Kernel Tangent Alignment -> Normalized Version of HSIC

  • []; Energy statistics; connections between distance correlations and kernel methods.


Normalized Variants#

Just like comparing covariance versus correlation, HSIC is difficult to interpret because it is unbounded and inconsistent with respect to samples and dimensions. HSIC suffers from the curse of dimensionality because for a certain set of samples and dimensions you can get a vastly different HSIC values even though they should be more consistent. As such, normalized variants were used such as [] which sought to alleviate this issue. Just like correlation, this works by normalizing the HSIC metric with the corresponding

(29)#\[ \rho_\text{nHSIC}(\mathbf{X,Y}) = \frac{\text{Tr}(\tilde{\mathbf{K}}_\mathbf{x}\tilde{\mathbf{K}}_\mathbf{y})}{\sqrt{||\tilde{\mathbf{K}}_\mathbf{x}||_F ||\tilde{\mathbf{K}}_\mathbf{y}||_F} } \]

This particular variant is called the Centered Kernel Alignment [] but through-out this thesis, we will just called it normalized HSIC to simplify the terminology. For more information, see this reference in the appendix for more details regarding HSIC.


Randomized Kernels#

Randomized kernel approximations allow us to estimate kernel matrices \(\mathbf{K}\) in less time by means of an approximation.

\[ \mathbf{K} \approx \hat{\mathbf{K}}= \mathbf{Z}\mathbf{Z}^\top \]

The most important thing to take home from this is that using the relationship between the dot product of kernel matrices in the sample space versus the dot product of datasets in the feature as seen in equation (30), we can potentially estimate the Frobenius norm much easily especially for very large scale datasets.

(30)#\[\begin{split} \begin{aligned} \text{Tr}\left(\tilde{\mathbf{K}}_\mathbf{x}\tilde{\mathbf{K}}_\mathbf{y} \right) &= \text{Tr}\left(\hat{\mathbf{K}}_\mathbf{x}\hat{\mathbf{K}}_\mathbf{y} \right)\\ &= \text{Tr}\left(\mathbf{Z}_\mathbf{x}\mathbf{Z}^\top_\mathbf{x}\mathbf{Z}_\mathbf{y}\mathbf{Z}^\top_\mathbf{y} \right)\\ &= ||\mathbf{Z}_\mathbf{x}^\top \mathbf{Z}_\mathbf{y}||_F^2 \end{aligned} \end{split}\]

In [], did an experiment to see how well approximation methods such as random fourier features (RFF) and Nyström did versus their original counterparts in independence testing. They found that for large-scale synthetic data, these methods performed rather well and encouraged uses to use these in applications in the future.


Mutual Information#

Information Theory#

  • Mutual Information is the counterpart to using information theory methods.

  • It requires an estimation step which may introduce additional uncertainties

  • Extends nicely to different types of data (e.g. discrete, categorical, multivariate, multidimentional)

  • Exposes non-linearities which may be difficult to see via (linear) correlations

  • Kernel Approximations: Although there are some differences for different estimators, relative distances are consistent

A Primer#

  • Entropy - measure of information uncertainty of \(X\)

  • Joint Entropy - uncertinaty of \(X,Y\)

  • Conditional Entropy - uncertainty of \(X\) given that I know \(Y\)

  • Mutual Information - how much knowning \(X\) reduces the uncertainty of \(Y\)

    • \(I(X,Y)=\)

  • Normalized Mutual Information

    • \(\tilde{I}(X,Y) = \frac{I(X,Y)}{\sqrt{H(X)H(Y)}}\)

Variation of Information#

A measure of distance in information theory space.

\[\begin{split} \begin{aligned} VI(X,Y) &= H(X|Y) + H(Y|X)\\ &= H(X) + H(Y) -2I(X,Y) \end{aligned} \end{split}\]

where:

  • \(VI(X,Y)=0\) Iff \(X\) and \(Y\) are the same

    • \(H(X,Y)=H(X)=H(Y)=I(X,Y)\)

  • \(VI(X,Y) < H(X,Y)\) If \(X\) and \(Y\) are different but dependent

    • \(H(X,Y)<H(X) + H(Y)\)

  • \(VI(X,Y)=H(X,Y)\) if \(X\) and \(Y\) are independent

    • \(H(X,Y)=H(X) + H(Y)\)

    • \(I(X,Y)=0\)

  • Variation of Information: []

  • Normalized Variants: []


Summary of Methods#

Name

Non-Linear

Multi-Dimensional

Isotropic Scaling

Orthogonal Transforms

Coefficient

Computational Cost

Pearson, \(\rho\)

:x:

:x:

:heavy_check_mark:

:heavy_check_mark:

:heavy_check_mark:

\(n\)

Spearman, \(\rho\)

:x:

:x:

:heavy_check_mark:

:heavy_check_mark:

:heavy_check_mark:

\(n\log n\)

RV Coeff, \(\rho RV\)

:x:

:heavy_check_mark:

:heavy_check_mark:

:heavy_check_mark:

:heavy_check_mark:

\(n^2\)

dCorr

:heavy_check_mark:

:heavy_check_mark:

:heavy_check_mark:

:heavy_check_mark:

:heavy_check_mark:

\(n^2\)

HSIC

:heavy_check_mark:

:heavy_check_mark:

:x:

:heavy_check_mark:

:x:

\(n^2\)

nHSIC

:heavy_check_mark:

:heavy_check_mark:

:heavy_check_mark:

:heavy_check_mark:

:heavy_check_mark:

\(n^2\)

MI

:heavy_check_mark:

:heavy_check_mark:

:heavy_check_mark:

:heavy_check_mark:

:x:

-

nMI

:heavy_check_mark:

:heavy_check_mark:

:heavy_check_mark:

:heavy_check_mark:

:x:

-


Questions#

  1. Are there correlations across seasons or latitudes

  2. Are there large descrepancies in the different outputs?

Classes of Methods#


Resources#

Websites#

Papers#

[1]: “The Mutual Information Diagram for Uncertainty Visualization - Correa & Lindstrom (2012)” [2]: “Summarizing multiple aspects of model performance in a single diagram”


References#

[dBezenacRB+20]

Emmanuel de Bézenac, Syama Sundar Rangapuram, Konstantinos Benidis, Michael Bohlke-Schneider, Richard Kurle, Lorenzo Stella, Hilaf Hasson, Patrick Gallinari, and Tim Januschowski. Normalizing kalman filters for multivariate time series analysis. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, 2995–3007. Curran Associates, Inc., 2020. URL: https://proceedings.neurips.cc/paper/2020/file/1f47cef5e38c952f94c5d61726027439-Paper.pdf.

[FKPW17]

Marco Fraccaro, Simon Kamronn, Ulrich Paquet, and Ole Winther. A disentangled recognition and nonlinear dynamics model for unsupervised learning. 2017. arXiv:1710.05741.

[GLB+21]

Laurent Girin, Simon Leglaive, Xiaoyu Bie, Julien Diard, Thomas Hueber, and Xavier Alameda-Pineda. Dynamical variational autoencoders: a comprehensive review. Foundations and Trends® in Machine Learning, 15(1-2):1–175, 2021. URL: http://dx.doi.org/10.1561/2200000089, doi:10.1561/2200000089.

[HK20]

Calvin Janitra Halim and K. Kawamoto. 2d convolutional neural markov models for spatiotemporal sequence forecasting. Sensors (Basel, Switzerland), 2020.

[Hen20]

Philipp Henning. Probabilistic ml - lecture 12 - gauss-markov models. 2020. URL: https://youtu.be/FydcrDtZroU (visited on 2021-10-10).

[HdHPK14]

Christopher Ramsay Holdgraf, Wendy de Heer, Brian N. Pasley, and Robert T. Knight. Evidence for Predictive Coding in Human Auditory Cortex. In International Conference on Cognitive Neuroscience. Brisbane, Australia, Australia, 2014. Frontiers in Neuroscience.

[LLBC21]

Wei Liu, Zhilu Lai, Kiran Bacsa, and Eleni Chatzi. Physics-guided deep markov models for learning nonlinear dynamical systems with uncertainty. 2021. arXiv:2110.08607.

[Mur12]

Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. MIT Press, 2012. URL: probml.ai.

[OFH+18]

Said Ouala, Ronan Fablet, Cédric Herzet, Bertrand Chapron, Ananda Pascual, Fabrice Collard, and Lucile Gaultier. Neural network based kalman filters for the spatio-temporal interpolation of satellite-derived sea surface temperature. Remote Sensing, 2018. URL: https://www.mdpi.com/2072-4292/10/12/1864, doi:10.3390/rs10121864.

[Wen21]

Lilian Weng. What are diffusion models? 2021. URL: https://lilianweng.github.io/lil-log/2021/07/11/diffusion-models.html.

[WHSX16]

Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P. Xing. Deep kernel learning. In Arthur Gretton and Christian C. Robert, editors, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS 2016, Cadiz, Spain, May 9-11, 2016, volume 51 of JMLR Workshop and Conference Proceedings, 370–378. JMLR.org, 2016. URL: http://proceedings.mlr.press/v51/wilson16.html.