Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

v2 — Data-driven, scene-level cloud climatology

Of the scenes that were actually acquired and ingested into the Planetary Computer / Earth Search catalogs, how many cover each pixel and what was their scene-wide cloud cover?

Goal

For every pixel of the shared 0.1° global grid and each sensor, scan the real STAC catalog and aggregate per-cell statistics from the item metadata (footprint + eo:cloud_cover + datetime). No pixel reads — this stage stays in metadata-land.

Question answered

“Given the catalog as it actually exists, how many scenes covered this pixel in [t0,t1][t_0, t_1], and what fraction were nominally cloud-free at the scene level?”

The user-facing pitch is the “reality gap” map: subtract v2’s scenes_count from v1’s overpasses and you visualise where the catalog is incomplete vs. theoretical max — useful for spotting processing-pipeline outages, off-nadir tasking effects, or regions where a sensor is simply not acquired (e.g., open ocean for L8/9 WRS-2 land grid).

Scope

Algorithm

Two-pass design — neither pass queries per cell:

Pass 1 — catalog ingestion

For each sensor’s collection:

  1. Issue a paginated STAC search over the global bbox + time window with no spatial filter (or a coarse one to chunk).
  2. Stream items into a DuckDB-backed GeoCatalog (geocatalog’s DuckDBGeoCatalog — GeoParquet 1.1 with bbox-column predicate pushdown). Persists as data/catalogs/<sensor>.parquet.
  3. Item attributes kept: id, datetime, geometry (footprint), eo:cloud_cover, platform.

Why DuckDB and not InMemory: 1 year of S2 over the globe is ~1M items. DuckDB handles this on a laptop; an InMemory GeoDataFrame doesn’t.

Pass 2 — per-cell aggregation

The key insight is inverting the loop. Don’t iterate cells and query the catalog. Iterate items and stamp them into cells:

  1. For each item, compute the set of grid cells its footprint intersects (rasterise the footprint polygon onto the 0.1° grid).
  2. For each cell, push the item’s (datetime, eo:cloud_cover) into that cell’s accumulator.
  3. After all items: each cell’s accumulator yields:
    • scenes_count = len(items)
    • mean_scene_cloud_pct = mean(item.cloud_cover for item in items)
    • cloud_free_scene_count = sum(c < 10 for c in cloud_covers)
    • max_gap_days = max(diff(sorted(datetimes)))

Either pass can be windowed by geopatcher to keep memory bounded: process the global grid in N-cell tiles, aggregate per-tile, concat.

Architecture

This is where your repo stack lights up:

projects/satellite_climatology/
└── src/satellite_climatology/
    ├── grid.py             # (shared with v1)
    ├── sensors.py          # (shared with v1) + STAC collection mapping
    ├── catalog.py          # build_global_catalog(sensor, t0, t1) -> DuckDBGeoCatalog
    ├── aggregate.py        # bin items into cells -> per-cell stats
    └── operators.py        # geotoolz Operators wrapping the reducers

Repo wiring:

Output

Adds these four bands to the shared Zarr:

scenes_count           (sensor, time, lat, lon) int16
mean_scene_cloud_pct   (sensor, time, lat, lon) float32
cloud_free_scene_count (sensor, time, lat, lon) int16
max_gap_days           (sensor, time, lat, lon) float32

Same grid as v1, so:

ds = xr.open_zarr("data/satellite_climatology.zarr")
gap_v1 = ds["overpasses"]                    # theoretical
got_v2 = ds["scenes_count"]                  # observed
catalog_gap = (gap_v1 - got_v2).clip(min=0)  # the "reality gap" map

Compute budget

Pass 1 (catalog scan):

Pass 2 (aggregation):

Storage:

UI integration

Adds new stats to the dashboard dropdown:

Same tile-server approach as v1.

H3 alternative (equal-area)

For latitudes > 60° the 0.1° lat/lon cell shrinks to ~5 km width while staying 11 km tall — heavy distortion. If equal-area matters (area-weighted statistics, comparing high vs low latitudes):

Decide this at M3 once we see the latitude-distortion in the v1 maps.

Risks & open questions

Acceptance

Out of scope