Droughts - Pre-Processing¶

In this notebook, I will be going over the preprocessing steps needed before starting the experiments. I will include the following steps:

Load Data
Select California
Fill NANs
Smoothing of the VOD signal (savgol filter)
Removing the climatology
Select drought years and non-drought years
Extract density cubes

Code¶

import sys, os
cwd = os.getcwd()
sys.path.insert(0, f'{cwd}/../../')
sys.path.insert(0, '/home/emmanuel/code/py_esdc')


import xarray as xr
import pandas as pd
import numpy as np

# drought tools
from src.data.drought.loader import DataLoader
from src.features.drought.build_features import (
    get_cali_geometry,
    mask_datacube,
    smooth_vod_signal,
    remove_climatology,
    get_cali_emdata,
    get_drought_years,
    get_density_cubes,
    get_common_elements,
    normalize
)
from src.visualization.drought.analysis import plot_mean_time

# esdc tools
from esdc.subset import select_pixel
from esdc.shape import ShapeFileExtract, rasterize
from esdc.transform import DensityCubes

import matplotlib.pyplot as plt
import cartopy
import cartopy.crs as ccrs
plt.style.use(['fivethirtyeight', 'seaborn-poster'])
%matplotlib inline

%load_ext autoreload
%autoreload 2

1. Load Data¶

region = 'conus'
sampling = '14D'

drought_cube = DataLoader().load_data(region, sampling)


pixel = (-121, 37)

drought_cube

<xarray.Dataset>
Dimensions:  (lat: 461, lon: 865, time: 146)
Coordinates:
  * lat      (lat) float64 25.88 25.93 25.98 26.03 ... 48.74 48.79 48.84 48.89
  * lon      (lon) float64 -124.4 -124.3 -124.2 -124.2 ... -80.64 -80.59 -80.54
  * time     (time) datetime64[ns] 2010-06-01 2010-06-15 ... 2015-12-22
Data variables:
    SMADI    (lat, lon, time) float64 ...
    LST      (lat, lon, time) float64 ...
    NDVI     (lat, lon, time) float64 ...
    VOD      (lat, lon, time) float64 ...
    SM       (lat, lon, time) float64 ...

Verify with a simple plot.

plot_mean_time(
    drought_cube.LST.sel(time=slice('June-2010', 'June-2010'))
)

/home/emmanuel/.conda/envs/2019_rbig_ad/lib/python3.6/site-packages/xarray/core/nanops.py:140: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis=axis, dtype=dtype)

2. Subset California¶

# get california polygon
cali_geoms = get_cali_geometry()

# get california cube subset
cali_cube = mask_datacube(drought_cube, cali_geoms)

plot_mean_time(
    cali_cube.LST.sel(time=slice('June-2011', 'June-2011'))
)

3. Interpolate NANs - Time Dimension¶

# interpolation arguments
interp_dim = 'time'
method = 'linear'

# do interpolation
cali_cube_interp = cali_cube.interpolate_na(
    dim=interp_dim, 
    method=method
)

4. Smoothing the Signal (VOD)¶

In this section, we will try to smooth the signal with two methods:

Simple - Rolling mean
Using a savgol filter.

Some initial parameters:

Window Size = 5
Polynomial Order = 3

We will apply this filter in the time domain only.

vod_data = cali_cube_interp.VOD
vod_data

<xarray.DataArray 'VOD' (lat: 189, lon: 103, time: 146)>
array([[[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       ...,

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]]])
Coordinates:
  * lat      (lat) float64 32.53 32.58 32.63 32.68 ... 41.79 41.84 41.89 41.94
  * lon      (lon) float64 -124.4 -124.3 -124.2 -124.2 ... -118.7 -118.7 -118.6
  * time     (time) datetime64[ns] 2010-06-01 2010-06-15 ... 2015-12-22

4.1 - Savgol Filter¶

from scipy.signal import savgol_filter

# select example
vod_data_ex = select_pixel(vod_data, pixel)

# savgol filter params
window_length = 5
polyorder = 3

# apply savgol filter
vod_smooth_filter = savgol_filter(
    vod_data_ex, 
    window_length=window_length, 
    polyorder=polyorder
)

fig, ax = plt.subplots(nrows=2, figsize=(10, 10))

ax[0].plot(vod_data_ex)
ax[0].set_title('Original Data')

ax[1].plot(vod_smooth_filter)
ax[1].set_title('After Savgol Filter')
plt.show()

4.2 - Rolling Window¶

# select example
vod_data_ex = select_pixel(vod_data, pixel)

# savgol filter params
window_length = 2

# apply savgol filter
vod_smooth_roll = vod_data_ex.rolling(
    time=window_length, 
    center=True
).mean()

fig, ax = plt.subplots(nrows=2, figsize=(10, 10))

ax[0].plot(vod_data_ex)
ax[0].set_title('Original Data')

ax[1].plot(vod_smooth_roll)
ax[1].set_title('After Rolling Mean')
plt.show()

4.3 - Difference¶

vod_smooth_diff = vod_smooth_filter - vod_smooth_roll

fig, ax = plt.subplots(nrows=4,figsize=(10,10))



ax[0].plot(vod_data_ex)
ax[0].set_title('Original')

ax[1].plot(vod_smooth_filter)
ax[1].set_title('Savgol Filter')

ax[2].plot(vod_smooth_roll)
ax[2].set_title('Rolling Mean')

ax[3].plot(vod_smooth_diff)
ax[3].set_title('Difference')

# Scale the Difference Y-Limits
ymax = np.max([vod_smooth_filter.max(), vod_smooth_roll.max()])
ymin = np.min([vod_smooth_filter.min(), vod_smooth_roll.min()])
center = (ymax - ymin)
ymax = ymax - center
ymin = center - ymin

ax[3].set_ylim([0 - ymin, 0 + ymax])

plt.tight_layout()
plt.show()

4.3 - Apply Rolling Mean to the whole dataset¶

cali_cube_interp = smooth_vod_signal(cali_cube_interp, window_length=2, center=True)

/home/emmanuel/.conda/envs/2019_rbig_ad/lib/python3.6/site-packages/xarray/core/nanops.py:140: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis=axis, dtype=dtype)

5. Remove Climatology¶

When I mean 'climatology', I mean the difference between observations and typical weather for a particular season. The anomalies should not show up in the seasonal cycle. I'll just do a very simple removal. I'll calculate the monthly mean wrt time and then remove that from each month from the original datacube.

Steps

Climatalogy - Monthly Mean for the 6 years
Remove Climatology - Climatology from each month

# remove climatology
cali_anomalies, cali_mean = remove_climatology(cali_cube_interp)

/home/emmanuel/.conda/envs/2019_rbig_ad/lib/python3.6/site-packages/xarray/core/nanops.py:140: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis=axis, dtype=dtype)

Simple check where we look at the original and the new.

variables = ['LST', 'VOD', 'NDVI', 'SM']

for ivariable in variables:

    fig, ax = plt.subplots(nrows=3, figsize=(10, 10))

    # Before Climatology
    select_pixel(cali_cube_interp[ivariable], pixel).plot(ax=ax[0])
    ax[0].set_title('Original Time Series')

    # Climatology
    select_pixel(cali_mean[ivariable], pixel).plot(ax=ax[1])
    ax[1].set_title('Climatology')

    # After Climatology
    select_pixel(cali_anomalies[ivariable], pixel).plot(ax=ax[2])
    ax[2].set_title('After Climatology Median Removed')
    plt.tight_layout()
    plt.show()

6. EMData¶

I extract the dates for the drought events for california. This will allow me to separate the drought years and non-drought years.

cali_droughts = get_cali_emdata()
cali_droughts

	ID	ISO	LOCATION	Date_Start	Date_End	UTC_Start	UTC_End	geometry
105	106	USA	California	01-Jun-2012	31-Dec-2012	735021	735234	(POLYGON ((-117.2328491210938 32.7764053344726...
132	133	USA	California	01-Jan-2014	31-Dec-2014	735600	735964	(POLYGON ((-117.2328491210938 32.7764053344726...
153	154	USA	California	01-Jan-2015	31-Dec-2015	735965	736329	(POLYGON ((-117.2328491210938 32.7764053344726...

So the drought years are:

Drought Years

2012
2014
2015

Non-Drought Years

2010
2011
2013

Note: Even though the EM-Data says that the drought year for 2012 is only half a year, we're going to say that that is a full year.

# Drought Years
cali_anomalies_drought = get_drought_years(
    cali_anomalies, ['2012', '2014', '2015']
)

# Non-Drought Years
cali_anomalies_nondrought = get_drought_years(
    cali_anomalies, ['2010', '2011', '2013']
)

7. Extract Density Cubes¶

In this step, we will construct 'density cubes'. These are cubes where we add features from a combination of the spatial and/or temporal dimensions. Instead of a single sample, we have a sample that takes into account spatial and/or temporal information. In this experiment, we will only look at temporal information. Our temporal resolution is 14 Days and we want to look at a maximum of 6 months.

So:

$<span><span class="MathJax_Preview">\Bigg\lfloor \frac{6 \: months}{\frac{14\: days}{30 \: days} \:\times 1 \: month} \Bigg\rfloor = 12 \: time \: stamps</span><script type="math/tex">\Bigg\lfloor \frac{6 \: months}{\frac{14\: days}{30 \: days} \:\times 1 \: month} \Bigg\rfloor = 12 \: time \: stamps$

# confirm
sub_ = cali_anomalies_drought.isel(time=slice(0,12))

sub_.time[0].data, sub_.time[-1].data

(array('2012-01-10T00:00:00.000000000', dtype='datetime64[ns]'),
 array('2012-06-12T00:00:00.000000000', dtype='datetime64[ns]'))

We see that the start date for the year 2012 is 01-10 and the end date is 06-12. It's good enough. So we get roughly 6 months of temporal information in our density cubes.

7.1 - Example Density Cube¶

# window sizes
spatial = 1
time = 12

vod_df, lst_df, ndvi_df, sm_df = get_density_cubes(cali_anomalies_drought, spatial, time)
vod_df.shape, lst_df.shape, ndvi_df.shape, sm_df.shape

((76690, 12), (85995, 12), (82729, 12), (76690, 12))

8. Find Common Elements¶

Notice how the number of elements is different depending upon the dataset. I believe it is the case that there are less elements for the VOD and the SM datasets. To make a fair comparison, I'll be using only the common elements between the two density cubes.

Note: This is also a bit difficult for RBIG to calculate the Mutual Information for datasets that are potentially so different in their domain.

vod_df, lst_df = get_common_elements(vod_df, lst_df)

vod_df.shape, lst_df.shape

((76690, 12), (76690, 12))