Skip to content

3.1 Results - Classifying Similarity

* J. Emmanuel Johnson * 11th Nov, 2019

Recap

Recall that we are trying to investigate how a new variable called Vegetation Optical Depth (VOD) compares to some of the previously used variables Land Surface Temperature (LST), Soil Moisture (SM) and Normalized Vegetation Difference Index (NDVI). The previous variables were often used to recover droughts, but studies have shown that VOD will better characterize drought conditions. So, the objective is to see how these VOD compares to other variables via information theory and other similarity measures. In addition, the data representation could make a difference in how similar VOD is to the other variables. For example, many spatial dimension or many temporal dimensions. So, we are going to look at different representations and different combinations of VOD, LST, SM, and NDVI and do comparisons.


Data

For this first set of results, we are looking at the CONUS dataset which is located in California. We have a period of 6 years where there were drought occurences and not drought occurences.

  • Droughts - 2012, 2014, 2015)
  • No Drought - 2010, 2011, 2013

We have a spatial resolution of ... and a temporal resolution of 14 days.


Methods

For this first experiment, we only want to classify the expected uncertainty that each variable has with different temporal representations. So for example, we can do self-comparisons like what's the expected uncertainty of VOD with 1 temporal dimension versus 3 temporal dimensions? We will do this for all 4 variables individually. We will use RBIG to measure the expected uncertainty (aka Entropy)

H(\mathbf{X}) = \mathbb{E}_\mathbf{x} \left[ -\log p(\mathbf{x}) \right]

Hypothesis

Adding temporal features will definitely increase the amount of entropy within a variable. We also expect to see some differences in the entropy value obtained from the different variables. However, there should be a point where adding more temporal dimensions may not add much more information.

We see some trend that perhaps gives us intuition that there could be a 'sweet' spot for the amount of temporal dimensions to use.


Preprocessing

  1. Climatology

I remove the climatology because we want to characterize the anomalies outside of the climate patterns.

  1. Similar Data

I ensure that the lat-lon-time locations are consistent across variables.

Note: We will have less samples as we increase the number of features because of the boundaries.

Code

import sys
from pyprojroot import here
import pathlib
PATH = pathlib.Path(str(here()))
sys.path.append(str(here()))
# sys.path.insert(0, '/home/emmanuel/projects/2019_rbig_ad/src')
# sys.path.append('/home/emmanuel/code/py_esdc')
# sys.path.append('/home/emmanuel/code/rbig')

from src.visualization.taylor import TaylorDiagram
from src.features.utils import move_variables

# DataCube PreProcessing
from scipy.io import savemat, loadmat
import geopandas as geopd
from rasterio import features

# Main Libraries
import numpy as np
import scipy.io as scio
import xarray as xr
import pandas as pd
import seaborn as sns
from datetime import date
import time

# IT Algorithms
# from rbig import RBIG, RBIGMI

# ML Preprocessing
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split
from scipy import signal

# Plotting
import cartopy
import cartopy.crs as ccrs

# NUMPY SETTINGS
import numpy as onp
onp.set_printoptions(precision=3, suppress=True)

# MATPLOTLIB Settings
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# SEABORN SETTINGS
import seaborn as sns
sns.set_context(context='talk',font_scale=0.7)
# sns.set(rc={'figure.figsize': (12, 9.)})
# sns.set_style("whitegrid")

# PANDAS SETTINGS
import pandas as pd
pd.set_option("display.max_rows", 120)
pd.set_option("display.max_columns", 120)

# LOGGING SETTINGS
import sys
import logging
logging.basicConfig(
    level=logging.INFO, 
    stream=sys.stdout,
    format='%(asctime)s:%(levelname)s:%(message)s'
)
logger = logging.getLogger()

# Utilities
import warnings
warnings.simplefilter('ignore', category=FutureWarning)

# Notebook Specifics
%load_ext autoreload
%autoreload 2
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
FIG_PATH = PATH.joinpath('reports/figures/drought/compare/taylor')
DATA_PATH = PATH.joinpath('data/drought/results/')
DATA = DATA_PATH.joinpath('exp_ind_v0.csv')
!ls $DATA_PATH
drought_v0_t12_s1_c2.csv  exp_group_v0.csv  exp_ind_v0.csv  exp_trial_v1.csv
drought_v0t1_s1_c2.csv    exp_group_v2.csv  exp_ind_v2.csv  group_trial_v1.csv

Experiment I - Individual Variables

data = pd.read_csv(DATA_PATH.joinpath('drought_v0_t12_s1_c2.csv'), index_col=[0])
data['drought'] = np.where(data['drought'] == 1, True, False)
data.tail()
cka_coeff cka_x_norm cka_xy_norm cka_y_norm drought kendall pearson rbig_H_time rbig_H_x rbig_H_y rbig_I_time rbig_I_xx rbig_I_xy rbig_Ixx_time rv_coef samples spearman temporal variable1 variable2 x_norm xy_norm y_norm year
427 0.221237 138.823248 2680.694657 87.282605 True -0.244070 -0.409710 36.325545 7.590462 17.872746 75.824331 637.188517 4.300701 88.219740 0.215439 24668.0 -0.355768 12.0 LST NDVI 45573.665268 3.994692e+08 40686.069771 2015.0
428 0.126565 138.823248 4658.927002 265.160938 True 0.068630 0.117024 35.337090 7.590462 13.565042 65.921709 637.188517 1.463533 88.408968 0.115821 24668.0 0.099671 12.0 LST SM 45573.665268 2.976909e+08 56397.987509 2015.0
429 0.063268 87.282605 715.642777 129.594257 True -0.082724 -0.099248 20.393083 17.872746 16.655743 63.903802 634.648435 2.370420 66.854350 0.061221 24668.0 -0.119624 12.0 NDVI VOD 40686.069771 1.169192e+08 46939.547837 2015.0
430 0.061859 129.594257 2125.671865 265.160938 True 0.087257 0.139381 30.667648 16.148727 13.565042 59.801988 637.978950 0.955537 97.393317 0.057369 24668.0 0.129730 12.0 VOD SM 46939.547837 1.518737e+08 56397.987509 2015.0
431 0.086073 87.282605 1992.058521 265.160938 True -0.049152 -0.126240 20.350596 17.872746 13.565042 54.584405 634.648435 2.612405 66.999012 0.078281 24668.0 -0.074243 12.0 NDVI SM 40686.069771 1.796239e+08 56397.987509 2015.0
# normalize
data['I'] = data['rbig_I_xy']
data['I_norm'] = data['rbig_I_xy'].div(data.temporal)

Taylor Diagram

sub_df.head()
cka_coeff cka_x_norm cka_xy_norm cka_y_norm drought kendall pearson rbig_H_time rbig_H_x rbig_H_y rbig_I_time rbig_I_xx rbig_I_xy rbig_Ixx_time rv_coef samples spearman temporal variable1 variable2 x_norm xy_norm y_norm year I I_norm
66 0.225857 77.761007 2364.838436 134.649918 False 0.221960 0.274675 15.629862 13.818506 7.451496 59.431192 638.710366 6.925343 45.498854 0.203241 7410.0 0.318416 12 VOD LST 31247.689450 2.054102e+08 32343.859151 2010.0 6.925343 0.682479
67 0.217807 134.649918 1945.704719 66.343726 False -0.195820 -0.240488 14.677892 7.451496 13.287757 55.166409 638.710366 2.157317 49.669100 0.202442 7410.0 -0.283381 12 LST NDVI 32343.859151 2.420993e+08 36974.382580 2010.0 2.157317 0.216804
68 0.100464 134.649918 2441.704764 180.499840 False 0.067317 0.072019 8.692752 7.451496 14.122546 28.423404 638.710366 0.937573 27.633737 0.089421 7410.0 0.099476 12 LST SM 32343.859151 1.192232e+08 41222.136572 2010.0 0.937573 0.091396
69 0.155190 77.761007 800.616047 66.343726 False -0.039435 -0.059687 6.192413 13.966970 13.287757 29.892026 638.404917 1.317379 26.868633 0.151686 7410.0 -0.061221 12 VOD NDVI 31247.689450 1.752521e+08 36974.382580 2010.0 1.317379 0.096702
70 0.046502 77.761007 652.700235 180.499840 False 0.012581 -0.019127 10.426789 13.818506 14.122546 37.671898 639.168164 0.563025 31.055566 0.041485 7410.0 0.018441 12 VOD SM 31247.689450 5.343606e+07 41222.136572 2010.0 0.563025 0.040303
def move_variables(df: pd.DataFrame, variable: str) -> pd.DataFrame:
    #     cond1 = df['variable1'] == variable
    cond = df["variable2"] == variable

    # swap other columns
    df.loc[cond, ["rbig_H_y", "rbig_H_x"]] = df.loc[
        cond, ["rbig_H_x", "rbig_H_y"]
    ].values
    df.loc[cond, ["cka_y_norm", "cka_x_norm"]] = df.loc[
        cond, ["cka_x_norm", "cka_y_norm"]
    ].values
    df.loc[cond, ["y_norm", "x_norm"]] = df.loc[
        cond, ["x_norm", "y_norm"]
    ].values

    # swap variables
    df.loc[cond, ["variable2", "variable1"]] = df.loc[
        cond, ["variable1", "variable2"]
    ].values

    return df
# conditions

drought = True
temporal = 12
reference = 'VOD'

save_name = f"{reference.lower()}_{drought}_{temporal}"
# subset dataframe
sub_df = data.copy()
if drought == 'both':
    pass
else:
    sub_df = sub_df[sub_df['drought'] == drought]

sub_df['temporal'] = sub_df['temporal'].astype(int)
sub_df = sub_df[sub_df['temporal'] == temporal]
sub_df = move_variables(sub_df, reference)

plot_taylor(sub_df, save_name)
(432, 26)
def plot_taylor(sub_df, save_name):

    # reference point
    reference_H = sub_df['rbig_H_x'].mean()

    # normalized mutual information
    sub_df['I_norm'] = sub_df['rbig_I_xy'] / np.sqrt(sub_df['rbig_H_x']) / np.sqrt(sub_df['rbig_H_y'])

    # init figure
    fig = plt.figure(figsize=(7.5, 7.5))

    taylor_fig = TaylorDiagram(
        ref_point=reference_H,
        fig=fig,
        subplot=111,
    #     corr_labels=(0.0, 0.01, 0.05, 0.075, 0.1, 0.15, 0.2),
        extend_angle=False,
        ref_range=(0, 50),
        var_label='Entropy',
        angle_label='Mutual Information'
    )

    taylor_fig.add_reference_point(
        reference_H,
        color='black',
        marker='.',
        markersize=20,
        linestyle='',
        label='VOD'
    )

    taylor_fig.add_reference_line(
        reference_H, 
        color='black',
        linestyle='--',
        label='_'
    )

    taylor_fig.add_contours(reference_H, levels=3, colors='gray')

    colors = {"NDVI": 'green', "LST": 'blue', "SM": 'red'}

    ndvi = sub_df[sub_df['variable2'] == 'NDVI'][['I_norm', 'temporal', 'rbig_H_y']]
    lst = sub_df[sub_df['variable2'] == 'LST'][['I_norm', 'temporal', 'rbig_H_y']]
    sm = sub_df[sub_df['variable2'] == 'SM'][['I_norm', 'temporal', 'rbig_H_y']]

    boundaries = (0.0, 12.0)
    norm = mpl.colors.Normalize(vmin=boundaries[0], vmax=boundaries[1])

    taylor_fig.add_scatter(
        ndvi['rbig_H_y'],
        ndvi['I_norm'],
        c='green',
        s=250,
        marker='.',
        label='NDVI',
        zorder=3,
    )

    taylor_fig.add_scatter(
        lst['rbig_H_y'],
        lst['I_norm'],
        c='blue',
        s=250,
        marker='.',
        label='LST',
        zorder=3,
    )

    taylor_fig.add_scatter(
        sm['rbig_H_y'],
        sm['I_norm'],
        c='red',
        s=250,
        marker='.',
        label='SM',
        zorder=3,
    )


    # cbar = plt.colorbar(
    #     taylor_fig.sample_points[1], fraction=0.046, extend='both', norm=norm
    # )

    # for index, idata in sub_df.iterrows():
    #     print(idata['rbig_H_y'], idata['I_norm'], idata['variable2'])
    #     taylor_fig.add_point(
    #         idata['rbig_H_y'],
    #         idata['I_norm'],
    #         marker='.',
    #         markersize=20,
    #         color=colors[idata['variable2']]
    #     )
    #     break
    # ax.set(xlabel='Entropy')
    plt.legend(loc='best')
    plt.tight_layout()

    plt.savefig(FIG_PATH.joinpath(save_name + '.png'))
# conditions
for drought in [True, False, 'both']:
    for temporal in np.arange(1, 12 + 1, dtype=int):
#         drought = True
#         temporal = 12
        reference = 'VOD'

        save_name = f"{reference.lower()}_{drought}_{temporal}"
        # subset dataframe
        sub_df = data.copy()
        if drought == 'both':
            pass
        else:
            sub_df = sub_df[sub_df['drought'] == drought]

        sub_df['temporal'] = sub_df['temporal'].astype(int)
        sub_df = sub_df[sub_df['temporal'] == temporal]
        sub_df = move_variables(sub_df, reference)

        plot_taylor(sub_df, save_name)

Entropy

def plot_mi(data, measure: str, variable: str, save=True, drought=True):

    data = data.copy()
    fig, ax = plt.subplots(figsize=(7, 5))

    if drought == 'on':
        drought = 'drought'
        data = data[data['year'].isin([2012, 2014, 2015])]
        style = None
    elif drought == 'off':
        drought = 'nondrought'
        style = None
        data = data[data['year'].isin([2010, 2011, 2013])]
    elif drought == 'both':
        drought = 'both'
        style = 'drought'
    else:
        raise ValueError('Unrecognized drought state: ', drought)

    # Select variable
    data = move_variables(data, variable)
    data = data[data['variable1'] == variable]

    sns.lineplot(
        x="temporal", y=measure, 
        hue='variable2', 
        data=data,
        style=style,
        marker='o', 
    )
    ax.set_xlabel('Temporal Dims')
    ax.set_ylabel('Entropy')
    # plt.legend(['NDVI', 'LST', 'SM', 'VOD'])
    plt.tight_layout()
    plt.show()
#     if normalized and save:
#         fig.savefig(f"{FIG_PATH}I_norm_individual_{drought}.png", frameon=False, )
#     elif save:
#         fig.savefig(f"{FIG_PATH}I_individual_{drought}.png", frameon=False, )


def move_variables(df: pd.DataFrame, variable: str)-> pd.DataFrame:
#     cond1 = df['variable1'] == variable
    cond = df['variable2'] == variable
    df.loc[
        cond, ['variable2', 'variable1']
    ] = df.loc[
        cond, ['variable1', 'variable2']
    ].values

    return df

Mutual Information

plot_mi(data, measure="I_norm", variable='VOD', drought='both')
plot_mi(data, measure="I_norm", variable='SM', drought='both')
plot_mi(data, measure="I_norm", variable='LST', drought='both')
plot_mi(data, measure="I_norm", variable='NDVI', drought='both')

RV Coefficient (Multivariate Pearson)

plot_mi(data, measure="rv_coef", variable='VOD', drought='both')
plot_mi(data, measure="rv_coef", variable='SM', drought='both')
plot_mi(data, measure="rv_coef", variable='LST', drought='both')
plot_mi(data, measure="rv_coef", variable='NDVI', drought='both')

Centered Kernel Alignment (Kernel Method)

plot_mi(data, measure="cka_coeff", variable='VOD', drought='both')
plot_mi(data, measure="cka_coeff", variable='SM', drought='both')
plot_mi(data, measure="cka_coeff", variable='LST', drought='both')
plot_mi(data, measure="cka_coeff", variable='NDVI', drought='both')

Drought Years vs Non-Drought Years