3.1 Results - Classifying Similarity¶
Recap¶
Recall that we are trying to investigate how a new variable called Vegetation Optical Depth (VOD) compares to some of the previously used variables Land Surface Temperature (LST), Soil Moisture (SM) and Normalized Vegetation Difference Index (NDVI). The previous variables were often used to recover droughts, but studies have shown that VOD will better characterize drought conditions. So, the objective is to see how these VOD compares to other variables via information theory and other similarity measures. In addition, the data representation could make a difference in how similar VOD is to the other variables. For example, many spatial dimension or many temporal dimensions. So, we are going to look at different representations and different combinations of VOD, LST, SM, and NDVI and do comparisons.
Data¶
For this first set of results, we are looking at the CONUS dataset which is located in California. We have a period of 6 years where there were drought occurences and not drought occurences.
- Droughts - 2012, 2014, 2015)
- No Drought - 2010, 2011, 2013
We have a spatial resolution of ... and a temporal resolution of 14 days.
Methods¶
For this first experiment, we only want to classify the expected uncertainty that each variable has with different temporal representations. So for example, we can do self-comparisons like what's the expected uncertainty of VOD with 1 temporal dimension versus 3 temporal dimensions? We will do this for all 4 variables individually. We will use RBIG to measure the expected uncertainty (aka Entropy)
Hypothesis¶
Adding temporal features will definitely increase the amount of entropy within a variable. We also expect to see some differences in the entropy value obtained from the different variables. However, there should be a point where adding more temporal dimensions may not add much more information.
We see some trend that perhaps gives us intuition that there could be a 'sweet' spot for the amount of temporal dimensions to use.
Preprocessing¶
- Climatology
I remove the climatology because we want to characterize the anomalies outside of the climate patterns.
- Similar Data
I ensure that the lat-lon-time locations are consistent across variables.
Note: We will have less samples as we increase the number of features because of the boundaries.
Code¶
import sys
from pyprojroot import here
import pathlib
PATH = pathlib.Path(str(here()))
sys.path.append(str(here()))
# sys.path.insert(0, '/home/emmanuel/projects/2019_rbig_ad/src')
# sys.path.append('/home/emmanuel/code/py_esdc')
# sys.path.append('/home/emmanuel/code/rbig')
# DataCube PreProcessing
from scipy.io import savemat, loadmat
import geopandas as geopd
from rasterio import features
# Main Libraries
import numpy as np
import scipy.io as scio
import xarray as xr
import pandas as pd
import seaborn as sns
from datetime import date
import time
# IT Algorithms
# from rbig import RBIG, RBIGMI
# ML Preprocessing
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split
from scipy import signal
# Plotting
import cartopy
import cartopy.crs as ccrs
# NUMPY SETTINGS
import numpy as onp
onp.set_printoptions(precision=3, suppress=True)
# MATPLOTLIB Settings
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# SEABORN SETTINGS
import seaborn as sns
sns.set_context(context='talk',font_scale=0.7)
# sns.set(rc={'figure.figsize': (12, 9.)})
# sns.set_style("whitegrid")
# PANDAS SETTINGS
import pandas as pd
pd.set_option("display.max_rows", 120)
pd.set_option("display.max_columns", 120)
# LOGGING SETTINGS
import sys
import logging
logging.basicConfig(
level=logging.INFO,
stream=sys.stdout,
format='%(asctime)s:%(levelname)s:%(message)s'
)
logger = logging.getLogger()
# Utilities
import warnings
warnings.simplefilter('ignore', category=FutureWarning)
# Notebook Specifics
%load_ext autoreload
%autoreload 2
FIG_PATH = PATH.joinpath('reports/figures/drought/compare/lines')
DATA_PATH = PATH.joinpath('data/drought/results/')
DATA = DATA_PATH.joinpath('exp_ind_v0.csv')
data = pd.read_csv(DATA_PATH.joinpath('drought_v0_t12_s1_c2.csv'), index_col=[0])
data['drought'] = np.where(data['drought'] == 1, True, False)
data.tail()
# normalize
data['I'] = data['rbig_I_xy']
data['I_norm'] = data['rbig_I_xy'].div(data.temporal)
# normalize
data['h_norm'] = data['h'].div(data.temporal)
data['tc_norm'] = data['tc'].div(data.temporal)
Mutual Information¶
def plot_mi(data, measure: str, variable: str, save=True, drought=True):
data = data.copy()
fig, ax = plt.subplots(figsize=(7, 5))
if drought == 'on':
drought = 'drought'
data = data[data['year'].isin([2012, 2014, 2015])]
style = None
elif drought == 'off':
drought = 'nondrought'
style = None
data = data[data['year'].isin([2010, 2011, 2013])]
elif drought == 'both':
drought = 'both'
style = 'drought'
else:
raise ValueError('Unrecognized drought state: ', drought)
if measure in ['I', 'I_norm']:
y_label = 'Mutual Information'
elif measure in ['rv_coef']:
y_label = 'Pearson (Multivariate)'
elif measure in ['cka_coeff']:
y_label = 'CKA'
else:
raise ValueError('Unrecognized measure...')
# Select variable
data = move_variables(data, variable)
data = data[data['variable1'] == variable]
sns.lineplot(
x="temporal", y=measure,
hue='variable2',
data=data,
style=style,
marker='o',
)
ax.set_xlabel('Temporal Dims')
ax.set_ylabel(y_label)
# plt.legend(['NDVI', 'LST', 'SM', 'VOD'])
plt.tight_layout()
# if normalized and save:
# fig.savefig(f"{FIG_PATH}I_norm_individual_{drought}.png", frameon=False, )
# elif save:
fig.savefig(FIG_PATH.joinpath(f"{variable}_{measure}_{drought}.png"))
def move_variables(df: pd.DataFrame, variable: str)-> pd.DataFrame:
# cond1 = df['variable1'] == variable
cond = df['variable2'] == variable
df.loc[
cond, ['variable2', 'variable1']
] = df.loc[
cond, ['variable1', 'variable2']
].values
return df
Mutual Information¶
plot_mi(data, measure="I_norm", variable='VOD', drought='both')
plot_mi(data, measure="I_norm", variable='SM', drought='both')
plot_mi(data, measure="I_norm", variable='LST', drought='both')
plot_mi(data, measure="I_norm", variable='NDVI', drought='both')
RV Coefficient (Multivariate Pearson)¶
plot_mi(data, measure="rv_coef", variable='VOD', drought='both')
plot_mi(data, measure="rv_coef", variable='SM', drought='both')
plot_mi(data, measure="rv_coef", variable='LST', drought='both')
plot_mi(data, measure="rv_coef", variable='NDVI', drought='both')
Centered Kernel Alignment (Kernel Method)¶
plot_mi(data, measure="cka_coeff", variable='VOD', drought='off')
plot_mi(data, measure="cka_coeff", variable='SM', drought='off')
plot_mi(data, measure="cka_coeff", variable='LST', drought='off')
plot_mi(data, measure="cka_coeff", variable='NDVI', drought='off')