Visually Comparing Climate Models¶

Summary¶

In this notebook, I will be comparing three climate reanalysis models:

NCEP-DOE Reanalysis 2: Surface
ERA5
CMIP5

I will be looking at the following variables:

Mean Sea Level Pressure (CMIP5, ERA5, NCEP)
Surface Pressure (ERA5, NCEP)

I will be trying to user RBIG in order to assess how similar these models are. I'll be looking at the following IT measures. If these climate models are that similar, then they should exhibit similar IT measures.

Preprocessing Steps¶

Regridded Spatially

The ERA5 had the coarsest spatial resolution (2.5 x 2.5).
I regridded the NCEP from (0.25 x 0.25) to (2.5 x 2.5).
I regridded the CMIP5 from (2 x 2.5) to (2.5 x 2.5).

Temporal Resolution

ERA5 and NCEP go from 1980-2019
CMIP5 goes from 2006-2018
For comparing ERA5 vs CMIP5 and NCEP vs CMIPF, I found the same time components

Measures¶

I'm measuring the following:

Entropy - expected uncertainty
Total Correlation - amount of redundant information between features
Mutual Information - amount of information shared between variables

Data¶

Inputs

I'm taking each year as is. Each spatial location is a sample and each year is a feature. My inputs are:

$X \in \mathbb{R}^{\text{spatial } \times \text{ month}}$

Outputs

All my information theory measures are in nats. They are scalars.

Hypothesis¶

Simple: The ERA5 and the NCEP model should be more similar than the CMIP5 model compared to each of them.

Data - Climate Models¶

import os, sys
cwd = os.getcwd()
source_path = f"{cwd}/../../../"
sys.path.insert(0, f'{source_path}')


# ESDC tools
sys.path.insert(0, f'/home/emmanuel/code/py_esdc')
# from esdc.preprocessing import normalize_temporal

import cdsapi
from zipfile import ZipFile
import pandas as pd
import xarray as xr
from tqdm import tqdm
from sklearn import preprocessing

# Visualization Tools
from src.data.climate.loader import ResultsLoader
from src.visualization.climate import PlotResults

import seaborn as sns
import matplotlib.pyplot as plt
# plt.style.use('ggplot')
plt.style.use(['seaborn-notebook', 'fivethirtyeight'])

%matplotlib inline

%load_ext autoreload
%autoreload 2

data_path = f"/home/emmanuel/projects/2020_rbig_rs/data/climate/raw/"
results_path = f"/home/emmanuel/projects/2020_rbig_rs/data/climate/results/amip/"
fig_path = f"/home/emmanuel/projects/2020_rbig_rs/reports/figures/climate/"

Results¶

CMIP5 vs ERA5 vs NCEP¶

!ls "/home/emmanuel/projects/2020_rbig_rs/data/climate/results/amip/"

compare_trial_v1.csv         spatial_compare_v2.csv
global_compare_v2.csv         spatial_compare_v3.csv
global_compare_v3.csv         spatial_compare_v4.csv
global_individual_v2.csv      spatial_individual_trial_v1.csv
global_individual_v3.csv      spatial_individual_v2.csv
global_individual_v4.csv      spatial_individual_v3.csv
individual_trial_v1.csv       spatial_individual_v4.csv
spatial_compare_trial_v1.csv

results = pd.read_csv(results_path + 'spatial_individual_v4' + '.csv', index_col=[0])
columns = results.columns.tolist()
results.tail()

	base	base_time	cmip	cmip_time	h_base	h_cmip	spatial	subsample	t_base	t_cmip	tc_base	tc_cmip	trial	variable
20349	ncep	1999-07-01	inmcm4	1999-06-16 00:00:00	-77.516609	-60.457848	6.0	5000.0	21.345011	29.123130	120.900891	107.421496	6.0	psl
20350	ncep	1999-08-01	inmcm4	1999-07-16 12:00:00	-79.558294	-60.430310	6.0	5000.0	23.388674	31.234617	119.664003	108.125878	6.0	psl
20351	ncep	1999-09-01	inmcm4	1999-08-16 12:00:00	-77.975434	-62.943946	6.0	5000.0	20.472294	29.212621	117.276838	110.032793	6.0	psl
20352	ncep	1999-10-01	inmcm4	1999-09-16 00:00:00	-86.973429	-62.825051	6.0	5000.0	23.205492	39.465716	117.996714	107.935966	6.0	psl
20353	ncep	1999-11-01	inmcm4	1999-10-16 12:00:00	-82.277103	-66.988666	6.0	5000.0	26.059670	38.457328	120.735649	108.947965	6.0	psl

# results['dates'] = pd.to_datetime(results.base_time)
# results = results.groupby(results.dates.dt.year).mean().drop_duplicates()
# results.head()

def plot_spatial_entropy(results_df: pd.DataFrame, base: str, cmip: str, spatial: int, normalize=True)-> None:

    # choose spatial window size
    results_df = results_df[results['spatial'] == spatial]
    results_df = results_df[results_df['base'] == base]
    results_df = results_df[results_df['cmip'] == cmip]

    results_df['dates'] = pd.to_datetime(results_df.base_time)
    results_df = results_df.groupby(results_df.dates.dt.year).transform('mean').drop_duplicates()

    # normalize
    if normalize:
        results_df['h_base'] = results_df['h_base'] / results_df['spatial']
        results_df['h_cmip'] = results_df['h_cmip'] / results_df['spatial']
    sns.scatterplot(data=results_df, x='dates', y='h_base', label=f"{base}")
    sns.scatterplot(data=results_df, x='dates', y='h_cmip', label=f"{cmip}")
    plt.title(f'{base.upper()}, CMIP: {cmip.upper()}, Spatial: {int(spatial)}')
    plt.xlabel('Years')
    plt.ylabel('Entropy, H')
    plt.legend()
    plt.show()

Example¶

# plot_spatial_entropy(results, 'era5', 'inmcm4')
plot_spatial_entropy(results, 'ncep', 'inmcm4', 1.0)
plot_spatial_entropy(results, 'ncep', 'inmcm4', 2.0)
plot_spatial_entropy(results, 'ncep', 'inmcm4', 3.0)
plot_spatial_entropy(results, 'ncep', 'inmcm4', 4.0)
plot_spatial_entropy(results, 'ncep', 'inmcm4', 5.0)
plot_spatial_entropy(results, 'ncep', 'inmcm4', 6.0)
plot_spatial_entropy(results, 'ncep', 'inmcm4', 7.0)
plot_spatial_entropy(results, 'ncep', 'inmcm4', 8.0)
plot_spatial_entropy(results, 'ncep', 'inmcm4', 9.0)
plot_spatial_entropy(results, 'ncep', 'inmcm4', 10.0)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-2a6fd108413e> in <module>
      6 plot_spatial_entropy(results, 'ncep', 'inmcm4', 5.0)
      7 plot_spatial_entropy(results, 'ncep', 'inmcm4', 6.0)
----> 8 plot_spatial_entropy(results, 'ncep', 'inmcm4', 7.0)
      9 plot_spatial_entropy(results, 'ncep', 'inmcm4', 8.0)
     10 plot_spatial_entropy(results, 'ncep', 'inmcm4', 9.0)

<ipython-input-7-6a32b1ebac42> in plot_spatial_entropy(results_df, base, cmip, spatial, normalize)
      7 
      8     results_df['dates'] = pd.to_datetime(results_df.base_time)
----> 9     results_df = results_df.groupby(results_df.dates.dt.year).transform('mean').drop_duplicates()
     10 
     11     # normalize

~/.conda/envs/2019_rbig_ad/lib/python3.6/site-packages/pandas/core/groupby/generic.py in transform(self, func, *args, **kwargs)
    592         # nuiscance columns
    593         if not result.columns.equals(obj.columns):
--> 594             return self._transform_general(func, *args, **kwargs)
    595 
    596         return self._transform_fast(result, obj, func)

~/.conda/envs/2019_rbig_ad/lib/python3.6/site-packages/pandas/core/groupby/generic.py in _transform_general(self, func, *args, **kwargs)
    564         concat_index = obj.columns if self.axis == 0 else obj.index
    565         other_axis = 1 if self.axis == 0 else 0  # switches between 0 & 1
--> 566         concatenated = concat(applied, axis=self.axis, verify_integrity=False)
    567         concatenated = concatenated.reindex(concat_index, axis=other_axis, copy=False)
    568         return self._set_result_index_ordered(concatenated)

~/.conda/envs/2019_rbig_ad/lib/python3.6/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    253         verify_integrity=verify_integrity,
    254         copy=copy,
--> 255         sort=sort,
    256     )
    257 

~/.conda/envs/2019_rbig_ad/lib/python3.6/site-packages/pandas/core/reshape/concat.py in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    302 
    303         if len(objs) == 0:
--> 304             raise ValueError("No objects to concatenate")
    305 
    306         if keys is None:

ValueError: No objects to concatenate

With Trials¶

results = pd.read_csv(results_path + 'spatial_individual_v2' + '.csv', index_col=[0])
columns = results.columns.tolist()
results.tail()

	base	base_time	cmip	cmip_time	h_base	h_cmip	spatial	subsample	t_base	t_cmip	tc_base	tc_cmip	variable
3604	ncep	1980-04-01	inmcm4	1980-03-16 12:00:00	-305.487809	-306.000709	2.0	100000.0	3131.432539	3195.500214	310.320863	310.573945	psl
3605	ncep	1980-05-01	inmcm4	1980-04-16 00:00:00	-305.418017	-306.468608	2.0	100000.0	3121.518930	3184.075421	310.309945	310.960205	psl
3606	ncep	1980-06-01	inmcm4	1980-05-16 12:00:00	-306.280263	-305.456243	2.0	100000.0	3131.040802	3193.071733	311.079259	309.870990	psl
3607	ncep	1980-07-01	inmcm4	1980-06-16 00:00:00	-304.751022	-306.033549	2.0	100000.0	3123.749256	3183.194890	309.773832	310.637945	psl
3608	ncep	1980-08-01	inmcm4	1980-07-16 12:00:00	-305.726799	-306.364675	2.0	100000.0	3131.382549	3194.702803	310.502920	311.400503	psl

# plot_spatial_entropy(results, 'era5', 'inmcm4')
plot_spatial_entropy(results, 'ncep', 'inmcm4', 1.0)
plot_spatial_entropy(results, 'ncep', 'inmcm4', 2.0)

plot_global_entropy(results, 'era5', 'access1_0')
plot_global_entropy(results, 'ncep', 'access1_0')

plot_global_entropy(results, 'era5', 'bcc_csm1_1')
plot_global_entropy(results, 'ncep', 'bcc_csm1_1')

Difference in Entropy¶

import numpy as np

def plot_global_diff_entropy(results_df: pd.DataFrame, base: str, normalized=True)-> None:


    results_copy = results_df[results_df['base'] == base]

    # calculate difference
    results_copy['h_diff'] = np.abs(results_copy['h_cmip'] - results_copy['h_base'])

    if normalized == True:
        results_copy['h_diff'] = results_copy['h_diff'] / results_copy['spatial']


    fig, ax = plt.subplots()
#     sns.scatterplot(ax=ax, data=results_copy, x='spatial', y='h_diff', hue='base', color='black')
    sns.scatterplot(ax=ax, data=results_copy, x='spatial', y='h_diff', hue='cmip')

    plt.title(f'')
    plt.xlabel('Spatial Features')
    plt.ylabel('Entropy, H')
    plt.legend(ncol=2, bbox_to_anchor=(2.05, 1), fontsize=16)
#     plt.tight_layout()
    plt.show()

plot_global_diff_entropy(results, 'era5')

/home/emmanuel/.conda/envs/2019_rbig_ad/lib/python3.6/site-packages/ipykernel_launcher.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
/home/emmanuel/.conda/envs/2019_rbig_ad/lib/python3.6/site-packages/ipykernel_launcher.py:12: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':

Total Correlation¶

def plot_global_tc(results_df: pd.DataFrame, base: str, cmip: str)-> None:
    results_df = results_df[results_df['base'] == base]
    results_df = results_df[results_df['cmip'] == cmip]
    sns.scatterplot(data=results_df, x='spatial', y='tc_base', label=f"{base}")
    sns.scatterplot(data=results_df, x='spatial', y='tc_cmip', label=f"{cmip}")
    plt.title(f'{base.upper()}, CMIP: {cmip.upper()}')
    plt.xlabel('Spatial Features')
    plt.ylabel('Total Correlation, TC')
    plt.legend()
    plt.show()

plot_global_tc(results, 'era5', 'bcc_csm1_1')
plot_global_tc(results, 'ncep', 'bcc_csm1_1')

def plot_global_diff_tc(results_df: pd.DataFrame, base: str, normalized=True)-> None:


    results_copy = results_df[results_df['base'] == base]

    # calculate difference
    results_copy['tc_diff'] = np.abs(results_copy['tc_cmip'] - results_copy['tc_base'])

    if normalized == True:
        results_copy['tc_diff'] = results_copy['tc_diff'] / results_copy['spatial']


    fig, ax = plt.subplots()
#     sns.scatterplot(ax=ax, data=results_copy, x='spatial', y='h_diff', hue='base', color='black')
    sns.scatterplot(ax=ax, data=results_copy, x='spatial', y='tc_diff', hue='cmip')

    plt.title(f'')
    plt.xlabel('Spatial Features')
    plt.ylabel('Total Correlation, TC')
    plt.legend(ncol=2, bbox_to_anchor=(2.05, 1), fontsize=16)
#     plt.tight_layout()
    plt.show()

plot_global_diff_tc(results, 'era5', normalized=True)
plot_global_diff_tc(results, 'ncep', normalized=True)

/home/emmanuel/.conda/envs/2019_rbig_ad/lib/python3.6/site-packages/ipykernel_launcher.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
/home/emmanuel/.conda/envs/2019_rbig_ad/lib/python3.6/site-packages/ipykernel_launcher.py:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.

Mutual Information¶

results = pd.read_csv(results_path + 'compare_trial_v1' + '.csv', index_col=[0])
results.head()

	base	cmip	kendelltau	mi	pearson	spatial	spearman	subsample	time_mi	variable
0	ncep	inmcm4	0.552030	9.962479	0.833456	1.0	0.711300	10000.0	39.619016	psl
1	ncep	inmcm4	0.568794	1.489607	0.842241	2.0	0.730925	10000.0	10.354173	psl
2	ncep	inmcm4	0.581166	1.555495	0.844528	3.0	0.745527	10000.0	65.894330	psl
3	ncep	inmcm4	0.562111	1.463338	0.803643	4.0	0.715848	10000.0	96.675056	psl
4	ncep	inmcm4	0.529905	1.766060	0.751060	5.0	0.670227	10000.0	73.076941	psl

def plot_global_mi(results_df: pd.DataFrame, base: str, cmip: str, measure='mi', normalized=True)-> None:

    results_df = results_df[results_df['base'] == base]
    results_df = results_df[results_df['cmip'] == cmip]

    if normalized and measure == 'mi':
        results_df[measure] = results_df[measure] / results_df['spatial']
    sns.scatterplot(data=results_df, x='spatial', y=measure, label=f"{base}")
    sns.scatterplot(data=results_df, x='spatial', y=measure, label=f"{cmip}")
    plt.title(f'{base.upper()}, CMIP: {cmip.upper()}')
    plt.xlabel('Spatial Features')
    plt.ylabel('Mutual Information, MI')
    plt.legend()
    plt.show()

plot_global_mi(results, 'era5', 'inmcm4', 'mi')
plot_global_mi(results, 'era5', 'inmcm4', 'pearson')
plot_global_mi(results, 'era5', 'inmcm4', 'spearman')
plot_global_mi(results, 'era5', 'inmcm4', 'kendelltau')

plot_global_mi(results, 'era5', 'noresm1_m', 'mi')
plot_global_mi(results, 'era5', 'noresm1_m', 'pearson')
plot_global_mi(results, 'era5', 'noresm1_m', 'spearman')
plot_global_mi(results, 'era5', 'noresm1_m', 'kendelltau')

def plot_global_diff_mi(results_df: pd.DataFrame, base: str, measure='mi', normalized=True, log_mi=True)-> None:


    results_copy = results_df[results_df['base'] == base]


    if normalized == True:
        results_copy[measure] = results_copy[measure] / results_copy['spatial']

    if log_mi == True and measure == 'mi':
        results_copy[measure] = np.log(results_copy[measure])

    fig, ax = plt.subplots()
#     sns.scatterplot(ax=ax, data=results_copy, x='spatial', y='h_diff', hue='base', color='black')
    sns.scatterplot(ax=ax, data=results_copy, x='spatial', y=measure, hue='cmip')

    plt.title(f'')
    plt.xlabel('Spatial Features')
    plt.ylabel(f'{measure.upper()}')
    plt.legend(ncol=2, bbox_to_anchor=(2.05, 1), fontsize=16)
#     plt.tight_layout()
    plt.show()

plot_global_diff_mi(results, 'era5', 'mi', True)
plot_global_diff_mi(results, 'era5', 'pearson', True, False)
plot_global_diff_mi(results, 'era5', 'spearman', True, False)
plot_global_diff_mi(results, 'era5', 'kendelltau', True, False)

/home/emmanuel/.conda/envs/2019_rbig_ad/lib/python3.6/site-packages/ipykernel_launcher.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/home/emmanuel/.conda/envs/2019_rbig_ad/lib/python3.6/site-packages/ipykernel_launcher.py:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()