Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

v3 — Global pixel-level cloud climatology

v2.5’s per-AOI clear-fraction algorithm, batched over the whole globe and written into the shared Zarr.

This is the capstone: the same per-scene operator graph from v2.5 applied to every cell of the global 0.1° grid for every scene in v2’s catalog, producing the truthful per-pixel clear-observation statistics at planetary scale.

Goal

For every cell of the shared 0.1° global grid, each sensor, and each monthly time bin, compute the true pixel-level clear-observation statistics by reading each intersecting scene’s QA band, masking to clear pixels, and aggregating into the cell.

Question answered

“How often was the surface inside this exact cell actually visible to the sensor over [t0,t1][t_0, t_1]?”

The truthful version of v2’s question, answered globally. The v2-vs-v3 difference map is the headline product: it shows where scene-level cloud is a biased estimate of pixel-level observability — coastal, mountain, partly-cloudy regions.

Scope

Algorithm

The same per_scene operator graph from v2.5, run with the loop inverted to keep compute tractable:

1. For each (sensor, time_bin):
     items = v2_catalog.query(global_bbox, time_bin)
2. For each item (in parallel across workers):
     qa_band = georeader.read_full(item.assets[qa_band_key])   # one full SCL/QA read
     clear_pixel_mask = decoder(qa_band, clear_classes=cfg.clear_classes)
     # Reduce to grid cells immediately — don't accumulate full-res masks.
     cells_touched = footprint_to_cells(item.geometry, grid_resolution=0.1°)
     for cell in cells_touched:
         window = clear_pixel_mask.window(cell)                # ~10×10 pixels at 0.1°
         emit (cell, item.datetime, window.mean())             # one float per (cell, item)
3. For each cell, aggregate the per-item floats:
     clear_obs_count    = sum(frac > clear_threshold)
     clear_fraction     = mean(frac)
     pixel_max_gap_days = max(gap between frac > threshold)
4. Write into shared Zarr at (sensor, time, lat, lon).

Why loop-inversion is essential: the alternative (“for each cell, loop over items”) opens each scene N times (N = cells per scene ~ 100). Inversion opens each scene once. Same total work, 100× less I/O.

Architecture

projects/satellite_climatology/
└── src/satellite_climatology/
    ├── grid.py             # (shared)
    ├── sensors.py          # (shared)
    ├── qa_decoders/        # (shared with v2.5)
    ├── operators.py        # (shared with v2.5) — the same per_scene graph
    ├── catalog.py          # (shared with v2) — DuckDBGeoCatalog scan
    ├── aggregate.py        # (extends v2) — bin per-item clear floats into cells
    └── global_pipeline.py  # the v3 entry point — runs the inverted loop

The four-repo capstone in one diagram:

            +---- DuckDBGeoCatalog (geocatalog) ----+
            | one row per STAC item                  |
            | bbox + datetime + asset hrefs          |
            +---------------------------------------+
                          |
                          v
            +---- geopatcher.SpatialPatcher ---------+
            | iterates monthly batches of items      |
            | (parallelisation unit = N items)       |
            +---------------------------------------+
                          |
                          v
            +---- geotoolz Operator graph -----------+   <- IDENTICAL to v2.5
            | ResolveAsset                           |
            | | ReadQA(reader="georeader")  <-+      |
            | | DecodeQA(decoder=...)        |      |  georeader does the
            | | CellClearFrac (per-cell mean)|      |  windowed read
            +--------------------------------|------+
                          |                  |
                          v                  v
            +---- accumulator (xarray Zarr) --------+
            | bands: clear_obs_count, clear_frac,   |
            |        pixel_max_gap_days             |
            +---------------------------------------+

Output

Writes the last three bands of the shared Zarr:

clear_obs_count        (sensor, time, lat, lon) int16
clear_fraction         (sensor, time, lat, lon) float32
pixel_max_gap_days     (sensor, time, lat, lon) float32

After v3 the Zarr is complete and the dashboard shows the full ten-band set, including the v2-vs-v3 difference map.

Compute budget

Per S2 item (the worst case):

Sized down:

Memory: per-worker peak is one SCL band read (~150 MB live). Trivial.

Storage:

Sizing path

  1. M6.1 — single sensor (S2), continental AOI (CONUS), 0.5° grid, 1 year. Validates the pipeline end-to-end. Laptop-tractable.
  2. M6.2 — S2 only, CONUS, 0.1° grid, 1 year. Same scenes, finer binning.
  3. M6.3 — S2 only, global, 0.1° grid, 1 year. First cluster job. Decide on infra (Azure Batch, Dask cluster, etc.).
  4. M6.4 — Add Landsat 8/9.
  5. M6.5 — Add MODIS / VIIRS once their decoders ship.

Ship the dashboard with v2 + v2.5 at M5; v3 progressively fills in the global pixel-level bands as each milestone completes.

UI integration

The v3 Zarr is rendered as additional tile layers in the same dashboard from v1/v2. New dropdowns:

The v2.5 AOI tab is unchanged — it remains the user-facing “drill into a single AOI” tool, with v3 being its global archive.

Risks & open questions

Acceptance

Out of scope