DATASETS
In general, there are very easy datasets to build which are problem dependent.
Easy Spatiotemporally Independent
- Variable Transformations Spatially Dependent, Temporally Independent
- Segmentation, Classification, Regression
- Rasterio, Torchgeo, Raster-Vision
- Spatiotemporal Data - Hard
- xarray, zarr, rioxarray
- Weather Prediction
- XRPatcher
Core Operations
- Input
- Split
- Apply
- Combine
Concepts
- DataCube
- Time Series
- Scene, Image
- Scene —> ROI
- Image, Tile, Patch
- Area of Interest (AOI), Region
- Resolution, Frequency
- Bounding Box, Period
- Patch, Chip, Cube
Dataset Objects
- Dataset
- Sampler
- DataLoader
- DataModule
EO Data Problems¶
Image File Size. Often the image file sizes are way too big to fit into memory, let alone do transformations. To combat this, we often do patching whereby we take subsets of the image. However, we need to be careful because its also inefficient to keep loading the entire image into memory just to take a subset. Furthermore, this becomes even more expensive when we consider multiple large images.
Multiple Files for 1 Image. We often get a single scene that is split across files. These splits could be partitions of the time, space, or spectral band. It is satellite dependent, e.g., GOES is a geostationary satellite which splits it across bands and MODIS is a … satellite that splits it across space (tiles). So you can imagine when we load the data, we need to combine the files correctly. If you can, one should try to harmonize all datasets to a single format but sometimes this can be infeasible due to computational resources.
Heterogeneous, Multi-Modal Data. We often have multi-modal data which we wish to merge. For example, we may have two satellites that share some information in space or spectral channels. We need a way to be able to combine these datasets either as a union or intersection. The most frustrating thing is the fact that have heterogeneity across different datasets. So for example regarding the file split, outlined an example above with GOES and MODIS. In general, it’s always good to have some minimum amount of data homogeneity, i.e., all images are split in the same way and contain the same dimensions. However, there are some cases where this is impossible for example cases where we have limitations of computational resources and memory/storage.
Heterogeneous Data Types. Another problem is the data types. For example, rasters and polygons of distinct types. In general, ML works best with discrete, regular data structures. But recently there are a lot of new work with GNNs which are useful for irregular domains.
EO Meta-Data¶
The meta-data is the saving grace for EO data. This is because we have access to information that allows us to connect all EO datasets together.
CRS. We have a common coordinate reference system which should be present in every georeferenced Dataset. This gives every value of the field a context or reference.
Coordinate Transformations. We often have datasets with different CRS. However, we can easily project our data into any other CRS, ie a coordinate transformation: Fortunately for us, they involve very simple transformations so they are not expensive operations to do before or on-the-fly.
Resolution. While each georeferenced dataset has an underlying CRS, they often have different resolutions. The CRS allows us to resample our data according to a different resolution. We simply need to project our data to a common CRS and then apply an interpolation algorithm to the common resolution.
Patching. The CRS system also allows us to patch according to coordinates, not the absolute locations of the pixels within the image. It may not be necessary to use the CRS for training a simple ML model. However they become extremely useful when we deal with multi-modal inputs where we can have unions and intersections between datasets. In addition, inference requires us to combine the patches back together in a meaningful way so the CRS becomes very useful in the combination process.
Examples¶
The examples are split according to which data structure we choose to save our data.
GeoTIFF
- TorchGeo
- Scene - RasterDataset, RandomSampler, Training
- Tiles - RasterDataset, RandomBatchSampler, Training
- Pre-Patched - RasterDataset, PreChippedSampler, Training
- RasterDataset, GridSampler, Inference
- RasterVision
- Simple
- Separate Bands in Different Files
- Separate Tiles
NetCDF/ZARR
- XRPatcher - TorchData
- Rastervision - XArraySource
Pre-Patched Files (Numpy)
- mlx-data - functional
- TorchGeo - NonGeoDataset
- RasterVision - ImageSource
Polygons
- TorchGeo
- RasterVision
CSV Files
- Custom Dataloader
- Merlin DataLoader
MLX-Data¶
This is a neat little library that is purely functional and cross-platform. I think its a great way to create dataloaders while doing a lot of preprocessing stuff on the fly.
- Domain Files + Domain-Processing Chain + Normalize + Patching
- Domain-Processed Files + Normalize + Patching
- Pre-Patched Files + Normalize
Use Cases
- Helio-Physics - FITS Files
- Remote Sensing - GeoTIFF
- Geoscience - NetCDF
EO Datasets¶
When dealing with an ML-Compatible dataset, we have two choices: a geodataset and a nongeodataset. Essentially, we have to decide whether we want to build a custom dataset which accounts for georeference meta-data or do we want a generic dataset which does not necessarily account for the georeference meta-data.
GeoDataset¶
This is the new meta. We are essentially blurring the lines between geoprocessing and ml-processing. We can now do some of the things on the fly like CRS projections, light resampling, and patching.
Advantage: We keep all meta-data which could be useful like the static variables, e.g. coordinates, masks, CRS. It also makes experimenting with different aspects more flexible, eg patch size, different regions/aoi. It also makes the engineering much eaiser for users. We only have to worry about some like data homogenization and then the dataset will take care of the rest under the hood.
Disadvantage: We have to create our own custom datasets which take into account the meta-data. This requires a lot morr SWE than many people are qualified to do. However, this is changing as we see many new libraries coming up to try and handle this problem, e.g., TorchGeo, Raster-Vision, XRPatcher,
Personal Take I: I think this is akin to operator learning where we treat all data as a function instead of just discrete values.
Personal Take II: I think dealing with rasters is almost a solved problem. However, dealing with spatiotemporal datacubes is not… The main reason is that we don’t have a nice API for slicing subregions of the dataset without opening the full dataset or collisions.
NonGeoDataset¶
This was the way for many years. We essentially do all of our processing with georeferenced data and save it to a bucket. So CRS projections, resampling, normalization, pre-patching, and saving. Once its in this form, we no longer have to think about the geostuff and now we can focus solely in the machine learning.
The advantage is that these are conceptually much easier to code and modify because the majority of the ML methods use this kind of data.
The disadvantage is that we lose a lot of information. We also have to make hard decisions about the geo-preprocessing which are difficult to change in the future.
Pieces
- Parameters
- Operators
- Buckets
Params¶
- Region
- Spatial Resolution
- Temporal Period
- Temporal Frequency
Operators¶
- Download Data from Server
- Preprocessing Data Structure
- Subset - Region, Period
- Data Harmonization - Time, Space, Spectral Channels
- CRS - Reprojection
- Variable Transformations - Radiance/Reflectance, Velocity, FFT
- Resampling, Regridding,
- Interpolation - Gap/NAN-Filling
- Normalization
- Patching - Space, Time, Spectral Channel
- ML Data Structure
Buckets¶
- Raw Data
- Analysis-Ready
- ML-Ready
- Results
Normalization-Patching-ML-Ready¶
This part is the most flexible part of the pipeline. It is basically a balance between storage, RAM, and processing power.
Option I: Pre-Chipping¶
In this case, we will pre-chip the images to have consistent chipped datasets. Some advantages to this method is that we are free to choose the data structure of choice to save. This will allow flexibility for when people create their custom datasets provided they are simple data structures like
.tif
,.png
ornumpy
arrays. In addition, the user will not have to worry about making patches.
Part I - Get ML-Ready Data
- Load Analysis-Ready Data
- Initialize Normalizer
- Pre-Patching
- Save ML-Ready Data
- Save Normalizer
# select analysis-ready files
analysis_ready_files: List[str] = …
# load data
ds: Dataset = load_dataset(analysis_ready_files)
# calculate transformation parameters
transform_params: Dict = calculate_transform_params(ds, **params)
save_normalizer(…, transform_params)
# define patch parameters
patch_size: Dict = dict(lon=256, lat=256)
stride: Dict = dict(lon=64, lat=64)
# define patcher
patcher: Patcher = Patcher(patch_size, stride)
# save patches to ML Ready Bucket
file_path: Path = Path(…)
save_name_id: str = …
num_workers: int = …
save_patches(patcher, num_workers, file_path, save_name_id)
Part II - Create ML Dataset
- Load ML-Ready Data
- Load Normalizer
- Apply Normalizer
- Create Dataset
# get ml ready data files
ml_ready_data_files: List[str] = […]
# load transform params, init transform
transform_params = load_tranform_params(…)
transformer = init_transformer(transform_params)
# create dataset
ds = Dataset(files, transformer)
# demo item
num_samples: int = …
sample: Tensor[“B C H W”] = ds.sample(num_samples)
Option II: Patching Data Module¶
In this case, we will create a dataset that does some preprocessing on-the-fly. We just need to save the scenes to a chosen data structure and then we need a custom dataset which allows us to subset AOI and take p
- Load Analysis-Ready Data
- Initialize Normalizer
- Apply Normalizer
- Patch On The Fly
# get analysis ready data files
analysis_ready_files: List[str] = […]
# load transform params, init transform
transform_params: Dict = …
Transformer = init_transformer(transform_params)
# initialize patch parameters
patch_size = dict(lon=256, lat=256)
stride = dict(lon=64, lat=64)
# initialize dataset
ds: Dataset = Dataset(
analysis_ready_files,
transformer,
patch_size,
stride,
**kwargs
)
# demo item
sample: Tensor[“1 C 256 256”] = ds.sample(1)
Helio-Physics Examples¶
Minimal Data Harmonization¶
# download
raw_file = download(**params)
# filter files for anomalies
good_raw_files = list(filter(criteria, raw_file))
# open
data: Map = open(good_raw_files)
# validate data
data: Map = validate(data, **params)
# save to analysis ready bucket
analysis_file = save(data, **params)
Full ML Inference Loop¶
# open file
data: Map = open(analysis_ready_file)
# do helio-preprocessing - limb darkening, calibration
data: Map = helio_preprocess(data, **params)
# change data structure - Map —> NDArray
data: NDArray = change_ds(data, **params)
# apply the split operation - patching
data: NDArray = patcher(data, **params)
# machine learning pre-processing - normalize, patching
data: MLTensor = ml_preprocess(data, **params)
# load machine learning model
model: Model = load_model(**params)
# apply machine learning model
out: MLTensor = model(data)
# machine learning pre-processing
data: NDArray = ml_preprocess(data, inverse=True, **params)
# apply the combine operarptiom
data: NDArray = patcher(data, inverse=True, **params)
# change data structure
data: Map = change_ds(data, inverse=True, **params)
# apply helio post-processing
out: Map = helio_preprocess(data, inverse=True, **params)
ML-Ready PipeLine¶
Creating ML-Ready Data
# open file
data: Map = open(analysis)
# do helio-preprocessing - limb darkening, calibration
data: Map = helio_preprocess(data, **params)
# change data structure - Map —> NDArray
data: NDArray = change_ds(data, **params)
# initialize normalizer
normalizer_params = initialize_normalizer(data, **params)
# save normalizer
normalizer_file = save_normalizer(normalizer_params, **params)
# apply the split operation - patching (OPTIONAL)
data: NDArray = patcher(data, **params)
# save to analysis ready bucket
ml_ready_file = save(data, **params)
ML Training Loop
# create dataset
ds: MLDataset = MLDataset(ml_ready_file, ml_preprocess, **params)
# create a sampler
sampler: Sampler = Sampler(ds, **params)
# create a DataLoader
dl: DataLoader = DataLoader(ds, sampler, **params)
# initialize model
model: Model = Model(**params)
# initialize trainer
model.compile(opt, loss, callbacks, **params)
# fit model
model.fit(dl)
# save model
model_hub = model.save(**params)
@dataclass
class HelioProcessing:
limb_darkening: float = 0.1
def __call__(self, data: Map) -> Map:
# do something
data: Map = fn(data, **self)
return data
def post_processing(self, data) -> Map:
# do something
return data
# make an inference step
def inference_step(data):
data = normalize(data)
data = iti_model(data)
data = unnormalize(data)
return data
Inference Pipeline
# make an inference step
def inference_step(data):
data: ND = normalize(data)
data = iti_model(data)
data = unnormalize(data)
return data
Get some samples
samples: List[str] = […]
Loop through the chain
for isample in dset:
# apply helio-pipeline
data = helio_pipeline(isample, **params)
# create patches
patches = patcher(data, **params)
# apply inference step on patches
patches = list(map(inference_step, patches))
# unpatch patches
data = patcher(patches, reverse=True, **params)
# un-process helio-pipeline
data = helio_postprocess(data, **params)
**Create a Dataset Loader
dset = (
dset.to_buffer(open(files))
.key_transform(“key”, helio_pipeline)
.key_transform(“key”, sampler)
.batch(1)
)
Examples¶
Level I Domain Specific DS + Pre-Patching + NDArray
- Satellite Data X-Y - h5/tif/nc, rasterio/rioxarray/geotensor - sentinel, landsat
- Satellite Data Lat-Lon - h5/nc, xarray, npy - modis, goes
- Ocean Data Lat-Lon - nc, xarray, npy - glorys, hycom
- Heliophysics Data - fits, sunpy, npy/png - sdo
Level II Domain Specific + Pre-Patching + Domain Specific DS
- xrpatcher - netcdf
- torchgeo - tif