Skip to content

3.0 Results - Classifying Uncertainty

Recap

Recall that we are trying to investigate how a new variable called Vegetation Optical Depth (VOD) compares to some of the previously used variables Land Surface Temperature (LST), Soil Moisture (SM) and Normalized Vegetation Difference Index (NDVI). The previous variables were often used to recover droughts, but studies have shown that VOD will better characterize drought conditions. So, the objective is to see how these VOD compares to other variables via information theory and other similarity measures. In addition, the data representation could make a difference in how similar VOD is to the other variables. For example, many spatial dimension or many temporal dimensions. So, we are going to look at different representations and different combinations of VOD, LST, SM, and NDVI and do comparisons.


Data

For this first set of results, we are looking at the CONUS dataset which is located in California. We have a period of 6 years where there were drought occurences and not drought occurences.

  • Droughts - 2012, 2014, 2015)
  • No Drought - 2010, 2011, 2013

We have a spatial resolution of ... and a temporal resolution of 14 days.


Methods

For this first experiment, we only want to classify the expected uncertainty that each variable has with different temporal representations. So for example, we can do self-comparisons like what's the expected uncertainty of VOD with 1 temporal dimension versus 3 temporal dimensions? We will do this for all 4 variables individually. We will use RBIG to measure the expected uncertainty (aka Entropy)

H(\mathbf{X}) = \mathbb{E}_\mathbf{x} \left[ -\log p(\mathbf{x}) \right]

Hypothesis

Adding temporal features will definitely increase the amount of entropy within a variable. We also expect to see some differences in the entropy value obtained from the different variables. However, there should be a point where adding more temporal dimensions may not add much more information.

We see some trend that perhaps gives us intuition that there could be a 'sweet' spot for the amount of temporal dimensions to use.


Preprocessing

  1. Climatology

I remove the climatology because we want to characterize the anomalies outside of the climate patterns.

  1. Similar Data

I ensure that the lat-lon-time locations are consistent across variables.

Note: We will have less samples as we increase the number of features because of the boundaries.

Code

import sys
from pyprojroot import here
import pathlib
PATH = pathlib.Path(str(here()))
sys.path.append(str(here()))
# sys.path.insert(0, '/home/emmanuel/projects/2019_rbig_ad/src')
# sys.path.append('/home/emmanuel/code/py_esdc')
# sys.path.append('/home/emmanuel/code/rbig')


# DataCube PreProcessing
from scipy.io import savemat, loadmat
import geopandas as geopd
from rasterio import features

# Main Libraries
import numpy as np
import scipy.io as scio
import xarray as xr
import pandas as pd
import seaborn as sns
from datetime import date
import time

# IT Algorithms
# from rbig import RBIG, RBIGMI

# ML Preprocessing
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split
from scipy import signal

# Plotting
import cartopy
import cartopy.crs as ccrs

# NUMPY SETTINGS
import numpy as onp
onp.set_printoptions(precision=3, suppress=True)

# MATPLOTLIB Settings
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# SEABORN SETTINGS
import seaborn as sns
sns.set_context(context='talk',font_scale=0.7)
# sns.set(rc={'figure.figsize': (12, 9.)})
# sns.set_style("whitegrid")

# PANDAS SETTINGS
import pandas as pd
pd.set_option("display.max_rows", 120)
pd.set_option("display.max_columns", 120)

# LOGGING SETTINGS
import sys
import logging
logging.basicConfig(
    level=logging.INFO, 
    stream=sys.stdout,
    format='%(asctime)s:%(levelname)s:%(message)s'
)
logger = logging.getLogger()

# Utilities
import warnings
warnings.simplefilter('ignore', category=FutureWarning)

# Notebook Specifics
%load_ext autoreload
%autoreload 2
FIG_PATH = PATH.joinpath('reports/figures/drought/individual/lines/')
DATA_PATH = PATH.joinpath('data/drought/results/')
DATA_GROUP_PATH = PATH.joinpath('data/drought/results/compare')
DATA = DATA_PATH.joinpath('exp_ind_v0.csv')
!ls $DATA_GROUP_PATH
drought_v0_t12_s1_c2_smadi.csv v1_t12_s1_c2_smadi.csv
!ls $DATA_PATH
compare                   exp_group_v0.csv  exp_trial_v1.csv
drought_v0_t12_s1_c2.csv       exp_group_v2.csv  group_trial_v1.csv
drought_v0_t12_s1_c2_smadi_sm.csv  exp_ind_v0.csv
drought_v0t1_s1_c2.csv         exp_ind_v2.csv
datasets = [
    'exp_ind_v0.csv',
    'exp_group_v2.csv'
]

Experiment I - Individual Variables

data = pd.read_csv(DATA_PATH.joinpath('exp_ind_v0.csv'), index_col=[0])
data['drought'] = np.where(data['drought'] == 1, True, False)
data.tail()
drought h samples temporal time variable year
259 True 14.974458 28009.0 10.0 13.748295 LST 2015.0
260 True 15.870754 26338.0 11.0 18.317141 VOD 2015.0
261 True 12.780909 26338.0 11.0 13.455114 NDVI 2015.0
262 True 7.286896 26338.0 11.0 23.639647 SM 2015.0
263 True 16.665310 26338.0 11.0 14.535378 LST 2015.0

Normalize

# normalize
data['h_norm'] = data['h'].div(data.temporal)

Entropy

def plot_entropy(data, normalized=False, save=True, drought=True):
    fig, ax = plt.subplots(figsize=(7, 5))

    if drought == 'on':
        drought = 'drought'
        data = data[data['year'].isin([2012, 2014, 2015])]
        style = None
    elif drought == 'off':
        drought = 'nondrought'
        style = None
        data = data[data['year'].isin([2010, 2011, 2013])]
    elif drought == 'both':
        drought = 'both'
        style = 'drought'
    else:
        raise ValueError('Unrecognized drought state: ', drought)


    if normalized:
        y = 'h_norm'

    else:
        y = 'h'
    sns.lineplot(
        x="temporal", y=y, 
        hue='variable', 
        data=data,
        style=style,
        marker='o', 
    )
    ax.set_xlabel('Temporal Dims')
    ax.set_ylabel('Entropy')
    # plt.legend(['NDVI', 'LST', 'SM', 'VOD'])
    plt.tight_layout()
    plt.show()
#     if normalized and save:
#         fig.savefig(f"{FIG_PATH}H_norm_individual_{drought}.png", frameon=False, )
#     elif save:
    fig.savefig(FIG_PATH.joinpath(f"H_individual_{drought}.png"))
# normalize
data['h_norm'] = data['h'].div(data.temporal)

Drought Years vs Non-Drought Years

plot_entropy(data, normalized=True, save=False, drought='on')
plot_entropy(data, normalized=True, save=False, drought='off')
plot_entropy(data, normalized=True, save=False, drought='both')

Experiment II - Grouped Variables

data.variable1.unique().tolist()
['SMADI']
data = pd.read_csv(DATA_GROUP_PATH.joinpath('v1_t12_s1_c2_smadi.csv'), index_col=[0])
data['drought'] = np.where(data['drought'] == 1, True, False)

# subset dataframe
cols = ['drought', 'temporal', 'rbig_H_x', 'rbig_H_y', 'variable1', 'variable2', 'year']
data = data[cols]

# data.plot(x='temporal', y='rbig_H_x')
# data.plot(x='temporal', y='rbig_H_y')
# data['rbig_H_x'] = data['rbig_H_x'] / data['temporal']
# data['rbig_H_y'] = data['rbig_H_y'] / data['temporal']
# sns.lineplot(data=data, x='temporal', y='rbig_H_x', style='drought', label='SMADI')
# sns.lineplot(data=data, x='temporal', y='rbig_H_y', style='drought', label='SMADI+')
# plt.legend(['SMADI', 'SMADI+'])
# plt.ylabel('Entropy')
# plt.xlabel('Temporal Dims')
# plt.show()

data1 = pd.melt(
    data, 
    id_vars=['drought', 'temporal', 'year'],
#     value_name='variable', 
    value_vars=['variable1', 'variable2'],
    value_name='variable',
    var_name='no'

).set_index(['drought', 'temporal', 'year'])
data2 = pd.melt(
    data, 
    id_vars=['drought', 'temporal', 'year'],
#     value_name='variable', 
    value_vars=['rbig_H_x', 'rbig_H_y'],
    var_name='rbig',
    value_name='h'

).set_index(['drought', 'temporal', 'year'])
# data1.plot(x='temporal', y='h')
# data2.plot(x='temporal', y='h')
data = pd.concat([data1, data2], axis=1).drop(columns=['rbig', 'no']).reset_index()
data.tail()
drought temporal year variable h
139 True 8.0 2015.0 SMADI+ 33.845365
140 True 9.0 2015.0 SMADI+ 35.426773
141 True 10.0 2015.0 SMADI+ 36.998545
142 True 11.0 2015.0 SMADI+ 38.606384
143 True 12.0 2015.0 SMADI+ 41.001197
sns.lineplot(data=data.reset_index(), x='temporal', y='h', hue='variable', style='drought')
<matplotlib.axes._subplots.AxesSubplot at 0x7f26c2536190>
# normalize
data['h_norm'] = data['h'].div(data.temporal)
def plot_entropy_group(data, normalized=False, save=True, drought=True):
    fig, ax = plt.subplots(figsize=(7, 5))

    if drought == 'on':
        drought = 'drought'
        data = data[data['year'].isin([2012, 2014, 2015])]
        style = None
    elif drought == 'off':
        drought = 'nondrought'
        style = None
        data = data[data['year'].isin([2010, 2011, 2013])]
    elif drought == 'both':
        drought = 'both'
        style = 'drought'
    else:
        raise ValueError('Unrecognized drought state: ', drought)


    if normalized:
        y = 'h_norm'

    else:
        y = 'h'
    sns.lineplot(
        x="temporal", y=y, 
        hue='variable', 
        data=data,
        style=style,
        marker='o', 
    )
    ax.set_xlabel('Temporal Dims')
    ax.set_ylabel('Entropy')
    # plt.legend(['NDVI', 'LST', 'SM', 'VOD'])
    plt.tight_layout()
    plt.show()
#     if normalized and save:
#         fig.savefig(f"{FIG_PATH}H_norm_individual_{drought}.png", frameon=False, )
#     elif save:
    fig.savefig(FIG_PATH.joinpath(f"H_grouped_{drought}.png"))

plot_entropy_group(data, normalized=True, save=False, drought='on')
plot_entropy_group(data, normalized=True, save=False, drought='off')
plot_entropy_group(data, normalized=True, save=False, drought='both')

Experiment II - Comparing Variables

drought_v0t1_s1_c2
data_group = pd.read_csv(DATA_PATH.joinpath('drought_v0t1_s1_c2.csv'), index_col=[0])
data_group['drought'] = np.where(data_group['drought']==1, True, False)
data_group.head()
cka_coeff cka_x_norm cka_xy_norm cka_y_norm drought kendall pearson rbig_H_time rbig_H_x rbig_H_y rbig_I_time rbig_I_xx rbig_I_xy rbig_Ixx_time rv_coef samples spearman temporal variable1 variable2 x_norm xy_norm y_norm year
0 0.020049 2857.967075 172764.522738 3015.160538 False 0.121740 0.211504 0.740344 1.646508 1.891781 3.142588 54.408459 0.087162 4.658380 0.042068 25779.0 0.181002 1.0 LST SM 10084.888343 4.467320e+06 10529.804294 2010.0
1 0.079700 2857.967075 702843.612344 3085.617960 False -0.149840 -0.232198 0.740419 1.646508 1.968815 3.380640 54.408459 0.266987 4.658880 0.048041 25779.0 -0.213839 1.0 LST NDVI 10084.888343 4.785355e+06 9877.059353 2010.0
2 0.003792 3091.911787 35349.920936 3015.160538 False 0.056306 0.064995 0.739561 2.027987 1.891781 3.126399 55.048992 0.022043 4.142112 0.005580 25779.0 0.083071 1.0 VOD SM 10083.376387 5.924809e+05 10529.804294 2010.0
3 0.009281 3085.617960 86347.669846 3015.160538 False -0.081360 -0.107464 0.739851 1.968815 1.891781 3.123344 54.198307 0.027839 3.681402 0.008848 25779.0 -0.118719 1.0 NDVI SM 9877.059353 9.202224e+05 10529.804294 2010.0
4 0.008383 3085.617960 79979.840539 3091.911787 False 0.016970 0.008711 0.740168 1.968815 2.027987 3.181453 54.198307 0.033399 3.684294 0.000063 25779.0 0.024544 1.0 NDVI VOD 9877.059353 6.274789e+03 10083.376387 2010.0

Normalize

# normalize
data_group['mi_norm'] = data_group['rbig_I_xy'].div(data_group.temporal)
# cond1 = data_group['variable1'] == 'NDVI'
# cond2 = data_group['variable2'] == 'NDVI'
# data_group.loc[cond1 & cond2, ['variable1', 'variable2']] = data_group.loc[cond1 & cond2, ['variable2', 'variable1']].values
def move_variables(df: pd.DataFrame, variable: str)-> pd.DataFrame:
#     cond1 = df['variable1'] == variable
    cond = df['variable2'] == variable
    df.loc[
        cond, ['variable2', 'variable1']
    ] = df.loc[
        cond, ['variable1', 'variable2']
    ].values

    return df
df_new = move_variables(data_group, 'NDVI')

df_new.head()
cka_coeff cka_x_norm cka_xy_norm cka_y_norm drought kendall pearson rbig_H_time rbig_H_x rbig_H_y rbig_I_time rbig_I_xx rbig_I_xy rbig_Ixx_time rv_coef samples spearman temporal variable1 variable2 x_norm xy_norm y_norm year mi_norm
0 0.020049 2857.967075 172764.522738 3015.160538 False 0.121740 0.211504 0.740344 1.646508 1.891781 3.142588 54.408459 0.087162 4.658380 0.042068 25779.0 0.181002 1.0 LST SM 10084.888343 4.467320e+06 10529.804294 2010.0 0.087162
1 0.079700 2857.967075 702843.612344 3085.617960 False -0.149840 -0.232198 0.740419 1.646508 1.968815 3.380640 54.408459 0.266987 4.658880 0.048041 25779.0 -0.213839 1.0 NDVI LST 10084.888343 4.785355e+06 9877.059353 2010.0 0.266987
2 0.003792 3091.911787 35349.920936 3015.160538 False 0.056306 0.064995 0.739561 2.027987 1.891781 3.126399 55.048992 0.022043 4.142112 0.005580 25779.0 0.083071 1.0 VOD SM 10083.376387 5.924809e+05 10529.804294 2010.0 0.022043
3 0.009281 3085.617960 86347.669846 3015.160538 False -0.081360 -0.107464 0.739851 1.968815 1.891781 3.123344 54.198307 0.027839 3.681402 0.008848 25779.0 -0.118719 1.0 NDVI SM 9877.059353 9.202224e+05 10529.804294 2010.0 0.027839
4 0.008383 3085.617960 79979.840539 3091.911787 False 0.016970 0.008711 0.740168 1.968815 2.027987 3.181453 54.198307 0.033399 3.684294 0.000063 25779.0 0.024544 1.0 NDVI VOD 9877.059353 6.274789e+03 10083.376387 2010.0 0.033399

Mutual Information

from typing import Optional

def plot_heatmap(df: pd.DataFrame, year: Optional[float]=None, measure: str='mi_norm') -> None:

    # subset df with name
    sub_df = df[['year', 'variable1', 'variable2', measure]]

    #
    if year is not None:
        sub_df = sub_df[sub_df['year'].isin([year])].drop('year', axis=1)
    else:
        sub_df = sub_df.groupby('variable1', 'variable2', 'year').mean('year')

    ssub_df = sub_df.copy()
    ssub_df[['variable1', 'variable2']] = ssub_df[['variable2', 'variable1']]

    sub_df = pd.concat([sub_df, ssub_df]).pivot(index='variable1', columns='variable2', values=measure)

    fig, ax = plt.subplots()
    sns.heatmap(sub_df, ax=ax, annot=True, cmap='Reds', vmin=0)
#     ax.invert_yaxis()
    ax.set(xlabel='', ylabel='')
    plt.tight_layout()
    plt.show()

    return None
for iyear in [2010.0, 2011.0, 2012.0, 2013.0, 2014.0, 2015.0]:
    plot_heatmap(df_new, iyear, 'cka_coeff')

Experiment IV - Comparing Grouped Variables

data = pd.read_csv(DATA_GROUP_PATH.joinpath('v1_t12_s1_c2_smadi.csv'), index_col=[0])
data['drought'] = np.where(data['drought'] == 1, True, False)

# subset dataframe
cols = ['drought', 'temporal', 'rbig_I_xy', 'rv_coef', 'variable1', 'variable2', 'year']
data = data[cols]


# data.plot(x='temporal', y='rbig_H_x')
# data.plot(x='temporal', y='rbig_H_y')
# data['rbig_H_x'] = data['rbig_H_x'] / data['temporal']
# data['rbig_H_y'] = data['rbig_H_y'] / data['temporal']
# sns.lineplot(data=data, x='temporal', y='rbig_H_x', style='drought', label='SMADI')
# sns.lineplot(data=data, x='temporal', y='rbig_H_y', style='drought', label='SMADI+')
# plt.legend(['SMADI', 'SMADI+'])
# plt.ylabel('Entropy')
# plt.xlabel('Temporal Dims')
# plt.show()

# data = pd.melt(
#     data, 
#     id_vars=['drought', 'temporal', 'year', 'rbig_I_xy'],
# #     value_name='variable', 
#     value_vars=['variable1', 'variable2'],
#     value_name='variable',
#     var_name='no'

# ).drop(columns=['no']).reset_index()
# data1.plot(x='temporal', y='h')
# data2.plot(x='temporal', y='h')
# data = pd.concat([data1, data2], axis=1)
data.tail()
drought temporal rbig_I_xy rv_coef variable1 variable2 year
67 True 8.0 39.635912 0.901904 SMADI SMADI+ 2015.0
68 True 9.0 39.784822 0.902987 SMADI SMADI+ 2015.0
69 True 10.0 41.570516 0.904695 SMADI SMADI+ 2015.0
70 True 11.0 53.638460 0.904877 SMADI SMADI+ 2015.0
71 True 12.0 62.999866 0.907876 SMADI SMADI+ 2015.0
sns.lineplot(data=data.reset_index(), x='temporal', y='rbig_I_xy', style='drought')
<matplotlib.axes._subplots.AxesSubplot at 0x7f26c03a5c40>
data['I_xy_norm'] = data['rbig_I_xy'] / data['temporal']

sns.lineplot(data=data.reset_index(), x='temporal', y='I_xy_norm', style='drought')
<matplotlib.axes._subplots.AxesSubplot at 0x7f26c0636f10>
data['I_xy_lnorm'] = np.log(data['rbig_I_xy'] / data['temporal'])

sns.lineplot(data=data.reset_index(), x='temporal', y='I_xy_lnorm', style='drought')
<matplotlib.axes._subplots.AxesSubplot at 0x7f26c22f2820>
data['I_xy_lnorm'] = data['rbig_I_xy'] / np.log(data['temporal'])

sns.lineplot(data=data.reset_index(), x='temporal', y='I_xy_lnorm', style='drought')
<matplotlib.axes._subplots.AxesSubplot at 0x7f26c22f2880>
np.log(np.arange(1,12+1))
array([0.   , 0.693, 1.099, 1.386, 1.609, 1.792, 1.946, 2.079, 2.197,
       2.303, 2.398, 2.485])

RV Coefficient

sns.lineplot(data=data.reset_index(), x='temporal', y='rv_coef', style='drought')
<matplotlib.axes._subplots.AxesSubplot at 0x7f26c2722970>