3.1 Results - Classifying Similarity¶

* J. Emmanuel Johnson * 11^th Nov, 2019

Recap¶

Recall that we are trying to investigate how a new variable called Vegetation Optical Depth (VOD) compares to some of the previously used variables Land Surface Temperature (LST), Soil Moisture (SM) and Normalized Vegetation Difference Index (NDVI). The previous variables were often used to recover droughts, but studies have shown that VOD will better characterize drought conditions. So, the objective is to see how these VOD compares to other variables via information theory and other similarity measures. In addition, the data representation could make a difference in how similar VOD is to the other variables. For example, many spatial dimension or many temporal dimensions. So, we are going to look at different representations and different combinations of VOD, LST, SM, and NDVI and do comparisons.

Data¶

For this first set of results, we are looking at the CONUS dataset which is located in California. We have a period of 6 years where there were drought occurences and not drought occurences.

Droughts - 2012, 2014, 2015)
No Drought - 2010, 2011, 2013

We have a spatial resolution of ... and a temporal resolution of 14 days.

Methods¶

For this first experiment, we only want to classify the expected uncertainty that each variable has with different temporal representations. So for example, we can do self-comparisons like what's the expected uncertainty of VOD with 1 temporal dimension versus 3 temporal dimensions? We will do this for all 4 variables individually. We will use RBIG to measure the expected uncertainty (aka Entropy)

$H(\mathbf{X}) = \mathbb{E}_\mathbf{x} \left[ -\log p(\mathbf{x}) \right]$

Hypothesis¶

Adding temporal features will definitely increase the amount of entropy within a variable. We also expect to see some differences in the entropy value obtained from the different variables. However, there should be a point where adding more temporal dimensions may not add much more information.

We see some trend that perhaps gives us intuition that there could be a 'sweet' spot for the amount of temporal dimensions to use.

Preprocessing¶

Climatology

I remove the climatology because we want to characterize the anomalies outside of the climate patterns.

Similar Data

I ensure that the lat-lon-time locations are consistent across variables.

Note: We will have less samples as we increase the number of features because of the boundaries.

Code¶

import sys
from pyprojroot import here
import pathlib
PATH = pathlib.Path(str(here()))
sys.path.append(str(here()))
# sys.path.insert(0, '/home/emmanuel/projects/2019_rbig_ad/src')
# sys.path.append('/home/emmanuel/code/py_esdc')
# sys.path.append('/home/emmanuel/code/rbig')

from src.visualization.taylor import TaylorDiagram
from src.features.utils import move_variables

# DataCube PreProcessing
from scipy.io import savemat, loadmat
import geopandas as geopd
from rasterio import features

# Main Libraries
import numpy as np
import scipy.io as scio
import xarray as xr
import pandas as pd
import seaborn as sns
from datetime import date
import time

# IT Algorithms
# from rbig import RBIG, RBIGMI

# ML Preprocessing
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split
from scipy import signal

# Plotting
import cartopy
import cartopy.crs as ccrs

# NUMPY SETTINGS
import numpy as onp
onp.set_printoptions(precision=3, suppress=True)

# MATPLOTLIB Settings
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# SEABORN SETTINGS
import seaborn as sns
sns.set_context(context='talk',font_scale=0.7)
# sns.set(rc={'figure.figsize': (12, 9.)})
# sns.set_style("whitegrid")

# PANDAS SETTINGS
import pandas as pd
pd.set_option("display.max_rows", 120)
pd.set_option("display.max_columns", 120)

# LOGGING SETTINGS
import sys
import logging
logging.basicConfig(
    level=logging.INFO, 
    stream=sys.stdout,
    format='%(asctime)s:%(levelname)s:%(message)s'
)
logger = logging.getLogger()

# Utilities
import warnings
warnings.simplefilter('ignore', category=FutureWarning)

# Notebook Specifics
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

FIG_PATH = PATH.joinpath('reports/figures/drought/compare/taylor')
DATA_PATH = PATH.joinpath('data/drought/results/')
DATA = DATA_PATH.joinpath('exp_ind_v0.csv')

!ls $DATA_PATH

drought_v0_t12_s1_c2.csv  exp_group_v0.csv  exp_ind_v0.csv  exp_trial_v1.csv
drought_v0t1_s1_c2.csv    exp_group_v2.csv  exp_ind_v2.csv  group_trial_v1.csv

Experiment I - Individual Variables¶

data = pd.read_csv(DATA_PATH.joinpath('drought_v0_t12_s1_c2.csv'), index_col=[0])
data['drought'] = np.where(data['drought'] == 1, True, False)
data.tail()

	cka_coeff	cka_x_norm	cka_xy_norm	cka_y_norm	drought	kendall	pearson	rbig_H_time	rbig_H_x	rbig_H_y	rbig_I_time	rbig_I_xx	rbig_I_xy	rbig_Ixx_time	rv_coef	samples	spearman	temporal	variable1	variable2	x_norm	xy_norm	y_norm	year
427	0.221237	138.823248	2680.694657	87.282605	True	-0.244070	-0.409710	36.325545	7.590462	17.872746	75.824331	637.188517	4.300701	88.219740	0.215439	24668.0	-0.355768	12.0	LST	NDVI	45573.665268	3.994692e+08	40686.069771	2015.0
428	0.126565	138.823248	4658.927002	265.160938	True	0.068630	0.117024	35.337090	7.590462	13.565042	65.921709	637.188517	1.463533	88.408968	0.115821	24668.0	0.099671	12.0	LST	SM	45573.665268	2.976909e+08	56397.987509	2015.0
429	0.063268	87.282605	715.642777	129.594257	True	-0.082724	-0.099248	20.393083	17.872746	16.655743	63.903802	634.648435	2.370420	66.854350	0.061221	24668.0	-0.119624	12.0	NDVI	VOD	40686.069771	1.169192e+08	46939.547837	2015.0
430	0.061859	129.594257	2125.671865	265.160938	True	0.087257	0.139381	30.667648	16.148727	13.565042	59.801988	637.978950	0.955537	97.393317	0.057369	24668.0	0.129730	12.0	VOD	SM	46939.547837	1.518737e+08	56397.987509	2015.0
431	0.086073	87.282605	1992.058521	265.160938	True	-0.049152	-0.126240	20.350596	17.872746	13.565042	54.584405	634.648435	2.612405	66.999012	0.078281	24668.0	-0.074243	12.0	NDVI	SM	40686.069771	1.796239e+08	56397.987509	2015.0

# normalize
data['I'] = data['rbig_I_xy']
data['I_norm'] = data['rbig_I_xy'].div(data.temporal)

Taylor Diagram¶

sub_df.head()

	cka_coeff	cka_x_norm	cka_xy_norm	cka_y_norm	drought	kendall	pearson	rbig_H_time	rbig_H_x	rbig_H_y	rbig_I_time	rbig_I_xx	rbig_I_xy	rbig_Ixx_time	rv_coef	samples	spearman	temporal	variable1	variable2	x_norm	xy_norm	y_norm	year	I	I_norm
66	0.225857	77.761007	2364.838436	134.649918	False	0.221960	0.274675	15.629862	13.818506	7.451496	59.431192	638.710366	6.925343	45.498854	0.203241	7410.0	0.318416	12	VOD	LST	31247.689450	2.054102e+08	32343.859151	2010.0	6.925343	0.682479
67	0.217807	134.649918	1945.704719	66.343726	False	-0.195820	-0.240488	14.677892	7.451496	13.287757	55.166409	638.710366	2.157317	49.669100	0.202442	7410.0	-0.283381	12	LST	NDVI	32343.859151	2.420993e+08	36974.382580	2010.0	2.157317	0.216804
68	0.100464	134.649918	2441.704764	180.499840	False	0.067317	0.072019	8.692752	7.451496	14.122546	28.423404	638.710366	0.937573	27.633737	0.089421	7410.0	0.099476	12	LST	SM	32343.859151	1.192232e+08	41222.136572	2010.0	0.937573	0.091396
69	0.155190	77.761007	800.616047	66.343726	False	-0.039435	-0.059687	6.192413	13.966970	13.287757	29.892026	638.404917	1.317379	26.868633	0.151686	7410.0	-0.061221	12	VOD	NDVI	31247.689450	1.752521e+08	36974.382580	2010.0	1.317379	0.096702
70	0.046502	77.761007	652.700235	180.499840	False	0.012581	-0.019127	10.426789	13.818506	14.122546	37.671898	639.168164	0.563025	31.055566	0.041485	7410.0	0.018441	12	VOD	SM	31247.689450	5.343606e+07	41222.136572	2010.0	0.563025	0.040303

def move_variables(df: pd.DataFrame, variable: str) -> pd.DataFrame:
    #     cond1 = df['variable1'] == variable
    cond = df["variable2"] == variable

    # swap other columns
    df.loc[cond, ["rbig_H_y", "rbig_H_x"]] = df.loc[
        cond, ["rbig_H_x", "rbig_H_y"]
    ].values
    df.loc[cond, ["cka_y_norm", "cka_x_norm"]] = df.loc[
        cond, ["cka_x_norm", "cka_y_norm"]
    ].values
    df.loc[cond, ["y_norm", "x_norm"]] = df.loc[
        cond, ["x_norm", "y_norm"]
    ].values

    # swap variables
    df.loc[cond, ["variable2", "variable1"]] = df.loc[
        cond, ["variable1", "variable2"]
    ].values

    return df

# conditions

drought = True
temporal = 12
reference = 'VOD'

save_name = f"{reference.lower()}_{drought}_{temporal}"
# subset dataframe
sub_df = data.copy()
if drought == 'both':
    pass
else:
    sub_df = sub_df[sub_df['drought'] == drought]

sub_df['temporal'] = sub_df['temporal'].astype(int)
sub_df = sub_df[sub_df['temporal'] == temporal]
sub_df = move_variables(sub_df, reference)

plot_taylor(sub_df, save_name)

(432, 26)

def plot_taylor(sub_df, save_name):

    # reference point
    reference_H = sub_df['rbig_H_x'].mean()

    # normalized mutual information
    sub_df['I_norm'] = sub_df['rbig_I_xy'] / np.sqrt(sub_df['rbig_H_x']) / np.sqrt(sub_df['rbig_H_y'])

    # init figure
    fig = plt.figure(figsize=(7.5, 7.5))

    taylor_fig = TaylorDiagram(
        ref_point=reference_H,
        fig=fig,
        subplot=111,
    #     corr_labels=(0.0, 0.01, 0.05, 0.075, 0.1, 0.15, 0.2),
        extend_angle=False,
        ref_range=(0, 50),
        var_label='Entropy',
        angle_label='Mutual Information'
    )

    taylor_fig.add_reference_point(
        reference_H,
        color='black',
        marker='.',
        markersize=20,
        linestyle='',
        label='VOD'
    )

    taylor_fig.add_reference_line(
        reference_H, 
        color='black',
        linestyle='--',
        label='_'
    )

    taylor_fig.add_contours(reference_H, levels=3, colors='gray')

    colors = {"NDVI": 'green', "LST": 'blue', "SM": 'red'}

    ndvi = sub_df[sub_df['variable2'] == 'NDVI'][['I_norm', 'temporal', 'rbig_H_y']]
    lst = sub_df[sub_df['variable2'] == 'LST'][['I_norm', 'temporal', 'rbig_H_y']]
    sm = sub_df[sub_df['variable2'] == 'SM'][['I_norm', 'temporal', 'rbig_H_y']]

    boundaries = (0.0, 12.0)
    norm = mpl.colors.Normalize(vmin=boundaries[0], vmax=boundaries[1])

    taylor_fig.add_scatter(
        ndvi['rbig_H_y'],
        ndvi['I_norm'],
        c='green',
        s=250,
        marker='.',
        label='NDVI',
        zorder=3,
    )

    taylor_fig.add_scatter(
        lst['rbig_H_y'],
        lst['I_norm'],
        c='blue',
        s=250,
        marker='.',
        label='LST',
        zorder=3,
    )

    taylor_fig.add_scatter(
        sm['rbig_H_y'],
        sm['I_norm'],
        c='red',
        s=250,
        marker='.',
        label='SM',
        zorder=3,
    )


    # cbar = plt.colorbar(
    #     taylor_fig.sample_points[1], fraction=0.046, extend='both', norm=norm
    # )

    # for index, idata in sub_df.iterrows():
    #     print(idata['rbig_H_y'], idata['I_norm'], idata['variable2'])
    #     taylor_fig.add_point(
    #         idata['rbig_H_y'],
    #         idata['I_norm'],
    #         marker='.',
    #         markersize=20,
    #         color=colors[idata['variable2']]
    #     )
    #     break
    # ax.set(xlabel='Entropy')
    plt.legend(loc='best')
    plt.tight_layout()

    plt.savefig(FIG_PATH.joinpath(save_name + '.png'))

# conditions
for drought in [True, False, 'both']:
    for temporal in np.arange(1, 12 + 1, dtype=int):
#         drought = True
#         temporal = 12
        reference = 'VOD'

        save_name = f"{reference.lower()}_{drought}_{temporal}"
        # subset dataframe
        sub_df = data.copy()
        if drought == 'both':
            pass
        else:
            sub_df = sub_df[sub_df['drought'] == drought]

        sub_df['temporal'] = sub_df['temporal'].astype(int)
        sub_df = sub_df[sub_df['temporal'] == temporal]
        sub_df = move_variables(sub_df, reference)

        plot_taylor(sub_df, save_name)

Entropy¶

def plot_mi(data, measure: str, variable: str, save=True, drought=True):

    data = data.copy()
    fig, ax = plt.subplots(figsize=(7, 5))

    if drought == 'on':
        drought = 'drought'
        data = data[data['year'].isin([2012, 2014, 2015])]
        style = None
    elif drought == 'off':
        drought = 'nondrought'
        style = None
        data = data[data['year'].isin([2010, 2011, 2013])]
    elif drought == 'both':
        drought = 'both'
        style = 'drought'
    else:
        raise ValueError('Unrecognized drought state: ', drought)

    # Select variable
    data = move_variables(data, variable)
    data = data[data['variable1'] == variable]

    sns.lineplot(
        x="temporal", y=measure, 
        hue='variable2', 
        data=data,
        style=style,
        marker='o', 
    )
    ax.set_xlabel('Temporal Dims')
    ax.set_ylabel('Entropy')
    # plt.legend(['NDVI', 'LST', 'SM', 'VOD'])
    plt.tight_layout()
    plt.show()
#     if normalized and save:
#         fig.savefig(f"{FIG_PATH}I_norm_individual_{drought}.png", frameon=False, )
#     elif save:
#         fig.savefig(f"{FIG_PATH}I_individual_{drought}.png", frameon=False, )


def move_variables(df: pd.DataFrame, variable: str)-> pd.DataFrame:
#     cond1 = df['variable1'] == variable
    cond = df['variable2'] == variable
    df.loc[
        cond, ['variable2', 'variable1']
    ] = df.loc[
        cond, ['variable1', 'variable2']
    ].values

    return df

Mutual Information¶

plot_mi(data, measure="I_norm", variable='VOD', drought='both')
plot_mi(data, measure="I_norm", variable='SM', drought='both')
plot_mi(data, measure="I_norm", variable='LST', drought='both')
plot_mi(data, measure="I_norm", variable='NDVI', drought='both')

RV Coefficient (Multivariate Pearson)¶

plot_mi(data, measure="rv_coef", variable='VOD', drought='both')
plot_mi(data, measure="rv_coef", variable='SM', drought='both')
plot_mi(data, measure="rv_coef", variable='LST', drought='both')
plot_mi(data, measure="rv_coef", variable='NDVI', drought='both')

Centered Kernel Alignment (Kernel Method)¶

plot_mi(data, measure="cka_coeff", variable='VOD', drought='both')
plot_mi(data, measure="cka_coeff", variable='SM', drought='both')
plot_mi(data, measure="cka_coeff", variable='LST', drought='both')
plot_mi(data, measure="cka_coeff", variable='NDVI', drought='both')