Skip to content

Droughts - Pre-Processing

In this notebook, I will be going over the preprocessing steps needed before starting the experiments. I will include the following steps:

  1. Load Data
  2. Select California
  3. Fill NANs
  4. Smoothing of the VOD signal (savgol filter)
  5. Removing the climatology
  6. Select drought years and non-drought years
  7. Extract density cubes

Code

import sys, os
cwd = os.getcwd()
sys.path.insert(0, f'{cwd}/../../')
sys.path.insert(0, '/home/emmanuel/code/py_esdc')


import xarray as xr
import pandas as pd
import numpy as np

# drought tools
from src.data.drought.loader import DataLoader
from src.features.drought.build_features import (
    get_cali_geometry,
    mask_datacube,
    smooth_vod_signal,
    remove_climatology,
    get_cali_emdata,
    get_drought_years,
    get_density_cubes,
    get_common_elements,
    normalize
)
from src.visualization.drought.analysis import plot_mean_time

# esdc tools
from esdc.subset import select_pixel
from esdc.shape import ShapeFileExtract, rasterize
from esdc.transform import DensityCubes

import matplotlib.pyplot as plt
import cartopy
import cartopy.crs as ccrs
plt.style.use(['fivethirtyeight', 'seaborn-poster'])
%matplotlib inline

%load_ext autoreload
%autoreload 2

1. Load Data

region = 'conus'
sampling = '14D'

drought_cube = DataLoader().load_data(region, sampling)


pixel = (-121, 37)

drought_cube
<xarray.Dataset>
Dimensions:  (lat: 461, lon: 865, time: 146)
Coordinates:
  * lat      (lat) float64 25.88 25.93 25.98 26.03 ... 48.74 48.79 48.84 48.89
  * lon      (lon) float64 -124.4 -124.3 -124.2 -124.2 ... -80.64 -80.59 -80.54
  * time     (time) datetime64[ns] 2010-06-01 2010-06-15 ... 2015-12-22
Data variables:
    SMADI    (lat, lon, time) float64 ...
    LST      (lat, lon, time) float64 ...
    NDVI     (lat, lon, time) float64 ...
    VOD      (lat, lon, time) float64 ...
    SM       (lat, lon, time) float64 ...

Verify with a simple plot.

plot_mean_time(
    drought_cube.LST.sel(time=slice('June-2010', 'June-2010'))
)
/home/emmanuel/.conda/envs/2019_rbig_ad/lib/python3.6/site-packages/xarray/core/nanops.py:140: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis=axis, dtype=dtype)

2. Subset California

# get california polygon
cali_geoms = get_cali_geometry()

# get california cube subset
cali_cube = mask_datacube(drought_cube, cali_geoms)
plot_mean_time(
    cali_cube.LST.sel(time=slice('June-2011', 'June-2011'))
)

3. Interpolate NANs - Time Dimension

# interpolation arguments
interp_dim = 'time'
method = 'linear'

# do interpolation
cali_cube_interp = cali_cube.interpolate_na(
    dim=interp_dim, 
    method=method
)

4. Smoothing the Signal (VOD)

In this section, we will try to smooth the signal with two methods:

  1. Simple - Rolling mean
  2. Using a savgol filter.

Some initial parameters:

  • Window Size = 5
  • Polynomial Order = 3

We will apply this filter in the time domain only.

vod_data = cali_cube_interp.VOD
vod_data
<xarray.DataArray 'VOD' (lat: 189, lon: 103, time: 146)>
array([[[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       ...,

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]]])
Coordinates:
  * lat      (lat) float64 32.53 32.58 32.63 32.68 ... 41.79 41.84 41.89 41.94
  * lon      (lon) float64 -124.4 -124.3 -124.2 -124.2 ... -118.7 -118.7 -118.6
  * time     (time) datetime64[ns] 2010-06-01 2010-06-15 ... 2015-12-22

4.1 - Savgol Filter

from scipy.signal import savgol_filter
# select example
vod_data_ex = select_pixel(vod_data, pixel)

# savgol filter params
window_length = 5
polyorder = 3

# apply savgol filter
vod_smooth_filter = savgol_filter(
    vod_data_ex, 
    window_length=window_length, 
    polyorder=polyorder
)

fig, ax = plt.subplots(nrows=2, figsize=(10, 10))

ax[0].plot(vod_data_ex)
ax[0].set_title('Original Data')

ax[1].plot(vod_smooth_filter)
ax[1].set_title('After Savgol Filter')
plt.show()

4.2 - Rolling Window

# select example
vod_data_ex = select_pixel(vod_data, pixel)

# savgol filter params
window_length = 2

# apply savgol filter
vod_smooth_roll = vod_data_ex.rolling(
    time=window_length, 
    center=True
).mean()

fig, ax = plt.subplots(nrows=2, figsize=(10, 10))

ax[0].plot(vod_data_ex)
ax[0].set_title('Original Data')

ax[1].plot(vod_smooth_roll)
ax[1].set_title('After Rolling Mean')
plt.show()

4.3 - Difference

vod_smooth_diff = vod_smooth_filter - vod_smooth_roll
fig, ax = plt.subplots(nrows=4,figsize=(10,10))



ax[0].plot(vod_data_ex)
ax[0].set_title('Original')

ax[1].plot(vod_smooth_filter)
ax[1].set_title('Savgol Filter')

ax[2].plot(vod_smooth_roll)
ax[2].set_title('Rolling Mean')

ax[3].plot(vod_smooth_diff)
ax[3].set_title('Difference')

# Scale the Difference Y-Limits
ymax = np.max([vod_smooth_filter.max(), vod_smooth_roll.max()])
ymin = np.min([vod_smooth_filter.min(), vod_smooth_roll.min()])
center = (ymax - ymin)
ymax = ymax - center
ymin = center - ymin

ax[3].set_ylim([0 - ymin, 0 + ymax])

plt.tight_layout()
plt.show()

4.3 - Apply Rolling Mean to the whole dataset

cali_cube_interp = smooth_vod_signal(cali_cube_interp, window_length=2, center=True)
/home/emmanuel/.conda/envs/2019_rbig_ad/lib/python3.6/site-packages/xarray/core/nanops.py:140: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis=axis, dtype=dtype)

5. Remove Climatology

When I mean 'climatology', I mean the difference between observations and typical weather for a particular season. The anomalies should not show up in the seasonal cycle. I'll just do a very simple removal. I'll calculate the monthly mean wrt time and then remove that from each month from the original datacube.

Steps

  1. Climatalogy - Monthly Mean for the 6 years
  2. Remove Climatology - Climatology from each month
# remove climatology
cali_anomalies, cali_mean = remove_climatology(cali_cube_interp)
/home/emmanuel/.conda/envs/2019_rbig_ad/lib/python3.6/site-packages/xarray/core/nanops.py:140: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis=axis, dtype=dtype)

Simple check where we look at the original and the new.

variables = ['LST', 'VOD', 'NDVI', 'SM']

for ivariable in variables:

    fig, ax = plt.subplots(nrows=3, figsize=(10, 10))

    # Before Climatology
    select_pixel(cali_cube_interp[ivariable], pixel).plot(ax=ax[0])
    ax[0].set_title('Original Time Series')

    # Climatology
    select_pixel(cali_mean[ivariable], pixel).plot(ax=ax[1])
    ax[1].set_title('Climatology')

    # After Climatology
    select_pixel(cali_anomalies[ivariable], pixel).plot(ax=ax[2])
    ax[2].set_title('After Climatology Median Removed')
    plt.tight_layout()
    plt.show()

6. EMData

I extract the dates for the drought events for california. This will allow me to separate the drought years and non-drought years.

cali_droughts = get_cali_emdata()
cali_droughts
ID ISO LOCATION Date_Start Date_End UTC_Start UTC_End geometry
105 106 USA California 01-Jun-2012 31-Dec-2012 735021 735234 (POLYGON ((-117.2328491210938 32.7764053344726...
132 133 USA California 01-Jan-2014 31-Dec-2014 735600 735964 (POLYGON ((-117.2328491210938 32.7764053344726...
153 154 USA California 01-Jan-2015 31-Dec-2015 735965 736329 (POLYGON ((-117.2328491210938 32.7764053344726...

So the drought years are:

Drought Years

  • 2012
  • 2014
  • 2015

Non-Drought Years

  • 2010
  • 2011
  • 2013

Note: Even though the EM-Data says that the drought year for 2012 is only half a year, we're going to say that that is a full year.

# Drought Years
cali_anomalies_drought = get_drought_years(
    cali_anomalies, ['2012', '2014', '2015']
)

# Non-Drought Years
cali_anomalies_nondrought = get_drought_years(
    cali_anomalies, ['2010', '2011', '2013']
)

7. Extract Density Cubes

In this step, we will construct 'density cubes'. These are cubes where we add features from a combination of the spatial and/or temporal dimensions. Instead of a single sample, we have a sample that takes into account spatial and/or temporal information. In this experiment, we will only look at temporal information. Our temporal resolution is 14 Days and we want to look at a maximum of 6 months.

So:

\Bigg\lfloor \frac{6 \: months}{\frac{14\: days}{30 \: days} \:\times 1 \: month} \Bigg\rfloor = 12 \: time \: stamps

# confirm
sub_ = cali_anomalies_drought.isel(time=slice(0,12))

sub_.time[0].data, sub_.time[-1].data
(array('2012-01-10T00:00:00.000000000', dtype='datetime64[ns]'),
 array('2012-06-12T00:00:00.000000000', dtype='datetime64[ns]'))

We see that the start date for the year 2012 is 01-10 and the end date is 06-12. It's good enough. So we get roughly 6 months of temporal information in our density cubes.

7.1 - Example Density Cube

# window sizes
spatial = 1
time = 12

vod_df, lst_df, ndvi_df, sm_df = get_density_cubes(cali_anomalies_drought, spatial, time)
vod_df.shape, lst_df.shape, ndvi_df.shape, sm_df.shape
((76690, 12), (85995, 12), (82729, 12), (76690, 12))

8. Find Common Elements

Notice how the number of elements is different depending upon the dataset. I believe it is the case that there are less elements for the VOD and the SM datasets. To make a fair comparison, I'll be using only the common elements between the two density cubes.

Note: This is also a bit difficult for RBIG to calculate the Mutual Information for datasets that are potentially so different in their domain.

vod_df, lst_df = get_common_elements(vod_df, lst_df)
vod_df.shape, lst_df.shape
((76690, 12), (76690, 12))