Earth Science Data Cubes¶
- Author: J. Emmanuel Johnson
- Source: Notebook I | Notebook II
The kind folks behind the Earth System Data Lab (Data Cube System for Copernicus, DCS4SOP) have been developing a package called xcube
which will allow processing ESDC a lot easier. I tried using it a while back but I didn't have any success so I kind of gave up on it and decided to use some of my own functions. But they recently announced on their forum that the package was ready. So I decided to test it out and see how it works.
My Background¶
I've been working with the datacubes for quite some time and I've learned a lot thanks to their efforts. I was even a part of their Early Adopters program which allowed me to use their homegrown JupyterLab system (sign up for access here). I learned a lot about xarray data structures in general, and it really motivated me to try and incorporate different spatial-temporal data representation considerations in my research and thesis. What was super nice was that they had all of the data readily available for use and all I had to do was tinker with my algorithms on their server. But I still wanted some stuff to be on my home machine, my lab machine and possibly on google colab. Nothing heavy but just some light protoyping. So this is nice for those of us who don't necessarily want to use their system but would still like to play with the data. They went the extra mile and even made this package called xcube
. Now, for my purposes, I like it because it handles a lot of masking and clipping. But that's just the tip of the iceberg...
#@title Install Appropriate packages
# requirements
!pip install xarray zarr shapely affine rasterio geopandas
# Cartopy
!apt-get install libproj-dev proj-data proj-bin
!apt-get install libgeos-dev
!pip install --upgrade -e "git+https://github.com/SciTools/Cartopy.git#egg=cartopy"
# xcube package from source
!pip install --upgrade "git+https://github.com/dcs4cop/xcube.git#egg=xcube" --pre
#@title Core Packages
import shapely
import geopandas as gpd
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
# core xcube packages
from xcube.core.dsio import open_cube
from xcube.core.geom import (
clip_dataset_by_geometry,
mask_dataset_by_geometry,
clip_dataset_by_geometry,
rasterize_features
)
# plotting packages
import matplotlib.pyplot as plt
import cartopy
import cartopy.crs as ccrs
# load cube from bit bucket
cube_from_s3_bucket = open_cube("https://obs.eu-de.otc.t-systems.com/obs-esdc-v2.0.0/esdc-8d-0.25deg-1x720x1440-2.0.0.zarr")
cube_from_s3_bucket
1. Visualizing the Datacube¶
There have been some fantastic upgrades to the xarray
package. For example, visualizing the datacube is great and has an html format which drop-down boxes in a recent [release]((http://xarray.pydata.org/en/stable/whats-new.html#new-features).
#@title Visualize a Variable
cube_from_s3_bucket[['land_surface_temperature', 'root']]
2. Visualizing Maps¶
We can use the built-in function with xarray. There are more advanced ways but that's for another time.
fig, ax = plt.subplots(figsize=(10,7))
gpp_data = cube_from_s3_bucket.gross_primary_productivity.sel(time=slice('June-2010', 'June-2010'))
gpp_data.mean(dim='time', skipna=True).plot.pcolormesh(
ax=ax, cmap='viridis', robust=True,
)
ax.set_title('Gross Primary Productivity')
# ax.coastlines()
plt.show()
2. Making Subsets¶
Often times we never really want to work with the entire globe: we want to take a subset. I have been working with the western part of EuroAsia and a bit with the north of Africa. One way we can do it is naively by simply slicing the dataframe with the locations we want.
In this example, I'll use the following coordinates:
- Latitude: 35N-71.5N
- Longitude: 18E - 60W
#@title Slicing
# subset using xarray slicing
europe = gpp_data.sel(lat=slice(71.5, 35.5), lon=slice(-18.0, 60.0))
fig, ax = plt.subplots(figsize=(10,7))
europe.mean(dim='time').plot.imshow(ax=ax, cmap='viridis', robust=False)
ax.set_title('Gross Primary Productivity')
plt.show()
x1 = -18.0 # lon1, degree
x2 = 60.0 # lon2, degree
y1 = 35.5 # lat1, degree
y2 = 71.5 # lat2, degree
coords = x1, y1, x2, y2
#@title xcube
# convert into shapely bounding box
bbox = shapely.geometry.box(*coords)
# subset the cube with the bounding box
europe = xcube.core.geom.clip_dataset_by_geometry(gpp_data, bbox)
fig, ax = plt.subplots(figsize=(10,7))
europe.mean(dim='time').plot.imshow(ax=ax, cmap='viridis', robust=False)
ax.set_title('Gross Primary Productivity')
plt.show()
3. Shape Files and Masks¶
That was my custom area. But what about if someone gives you a specific shape file that they want you to subset. Or perhaps you would like to subset an entire region, e.g. North America or Spain. Dealing with shape files is quite easy but having to convert them into rasters isn't exactly trivial. We'll use geopandas
to deal with the shape file.
# get shape file location in package
shp_file_loc = gpd.datasets.get_path('naturalearth_lowres')
# read shape file
shp_gdf = gpd.read_file(shp_file_loc)
shp_gdf.head()
For this, Let's extract North America.
query = 'North America'
column = 'continent'
# subset shape files within NA
na_shp_clf = shp_gdf[shp_gdf[column] == query]
# let's collapse all shape files into one big shapefile
na_shp_clf = na_shp_clf.dissolve(by='continent').head()
na_shp_clf.head()
na_shp_clf.plot()
mask_dataset_by_geometry
so that we can set all values not in NA to NANs.
masked_gpp = mask_dataset_by_geometry(
cube_from_s3_bucket[['gross_primary_productivity']].sel(time=slice('June-2010', 'June-2010')),
na_shp_clf.geometry.values[0]
)
masked_gpp
fig, ax = plt.subplots(figsize=(10,7))
masked_gpp.gross_primary_productivity.mean(dim='time').plot.imshow(ax=ax, cmap='viridis', robust=False)
ax.set_title('Gross Primary Productivity')
plt.show()
query = 'Spain'
column = 'name'
spain_shp_df = shp_gdf[shp_gdf[column] == query]
# let's collapse all shape files into one big shapefile
spain_shp_df = spain_shp_df.dissolve(by='continent').head()
# plot
spain_shp_df.plot(); spain_shp_df.head()
#@title Masking
masked_gpp = mask_dataset_by_geometry(
cube_from_s3_bucket[['gross_primary_productivity']].sel(time=slice('June-2010', 'June-2010')),
spain_shp_df.geometry.values[0])
fig, ax = plt.subplots(figsize=(10,7))
masked_gpp.gross_primary_productivity.mean(dim='time').plot.imshow(ax=ax, cmap='viridis', robust=False)
ax.set_title('Gross Primary Productivity')
plt.show(); masked_gpp
#@title Clipping
masked_gpp = clip_dataset_by_geometry(
cube_from_s3_bucket[['gross_primary_productivity']].sel(time=slice('June-2010', 'June-2010')),
spain_shp_df.geometry.values[0])
fig, ax = plt.subplots(figsize=(10,7))
masked_gpp.gross_primary_productivity.mean(dim='time').plot.imshow(ax=ax, cmap='viridis', robust=False)
ax.set_title('Gross Primary Productivity')
plt.show(); masked_gpp
What Now?¶
Where there are many more things to explore. I just touched the surface because for my needs, this is already perfect as I'm more focused on other things. But I would like to point out some other nuggets that they have:
More Information
They have a link which shows all of the cubes they currently have. They have a few with different spatial resolutions and some configurations are more optimized for spatial computations than temporal computations. The latest public complete list of all variables can be found here. And more information about the organization DCS4COP can be found here.
Viewer
They have their own viewer that you can run locally or view the cubes via the server