v2 — Data-driven, scene-level cloud climatology
Of the scenes that were actually acquired and ingested into the Planetary Computer / Earth Search catalogs, how many cover each pixel and what was their scene-wide cloud cover?
Goal¶
For every pixel of the shared 0.1° global grid and each sensor, scan
the real STAC catalog and aggregate per-cell statistics from the
item metadata (footprint + eo:cloud_cover + datetime). No pixel
reads — this stage stays in metadata-land.
Question answered¶
“Given the catalog as it actually exists, how many scenes covered this pixel in , and what fraction were nominally cloud-free at the scene level?”
The user-facing pitch is the “reality gap” map: subtract v2’s
scenes_count from v1’s overpasses and you visualise where the
catalog is incomplete vs. theoretical max — useful for spotting
processing-pipeline outages, off-nadir tasking effects, or regions
where a sensor is simply not acquired (e.g., open ocean for L8/9
WRS-2 land grid).
Scope¶
- Sensors: same five as v1, but each maps to one or more STAC
collections (e.g.,
sentinel-2-l2afor sentinel-2). - Time window: configurable. Default: rolling 12 months. Longer is feasible (24 → 36 months) but linearly more compute.
- Resolution: 0.1° (shared). H3 res-5 alternative discussed below.
- STAC source: Microsoft Planetary Computer (anonymous, no auth) for all five sensors. Element84 Earth Search as a fallback.
Algorithm¶
Two-pass design — neither pass queries per cell:
Pass 1 — catalog ingestion¶
For each sensor’s collection:
- Issue a paginated STAC search over the global bbox + time window with no spatial filter (or a coarse one to chunk).
- Stream items into a DuckDB-backed GeoCatalog (
geocatalog’sDuckDBGeoCatalog— GeoParquet 1.1 with bbox-column predicate pushdown). Persists asdata/catalogs/<sensor>.parquet. - Item attributes kept:
id,datetime,geometry(footprint),eo:cloud_cover,platform.
Why DuckDB and not InMemory: 1 year of S2 over the globe is ~1M items. DuckDB handles this on a laptop; an InMemory GeoDataFrame doesn’t.
Pass 2 — per-cell aggregation¶
The key insight is inverting the loop. Don’t iterate cells and query the catalog. Iterate items and stamp them into cells:
- For each item, compute the set of grid cells its footprint intersects (rasterise the footprint polygon onto the 0.1° grid).
- For each cell, push the item’s
(datetime, eo:cloud_cover)into that cell’s accumulator. - After all items: each cell’s accumulator yields:
scenes_count=len(items)mean_scene_cloud_pct=mean(item.cloud_cover for item in items)cloud_free_scene_count=sum(c < 10 for c in cloud_covers)max_gap_days=max(diff(sorted(datetimes)))
Either pass can be windowed by geopatcher to keep memory bounded:
process the global grid in N-cell tiles, aggregate per-tile, concat.
Architecture¶
This is where your repo stack lights up:
projects/satellite_climatology/
└── src/satellite_climatology/
├── grid.py # (shared with v1)
├── sensors.py # (shared with v1) + STAC collection mapping
├── catalog.py # build_global_catalog(sensor, t0, t1) -> DuckDBGeoCatalog
├── aggregate.py # bin items into cells -> per-cell stats
└── operators.py # geotoolz Operators wrapping the reducersRepo wiring:
geocatalogis the substrate.from_stac_search(STACSource(...))ingests the items into aDuckDBGeoCatalogper sensor.- The catalog persists between runs (it’s GeoParquet on disk), so re-aggregating with different thresholds is free.
geopatcherdrives the grid iteration.SpatialPatcher(SpatialRectangular(size=(0.1°, 0.1°)), SpatialRegularStride(...))over aRasterFieldcovering the global bbox.- Per-tile (not per-cell) work is run via
runners.parallel_mapacross processes.
geotoolzholds the reducer as a composable graph:per_cell = ( Fanout({ "scenes_count": Count(), "mean_scene_cloud_pct": Mean(field="cloud_cover"), "cloud_free_scene_count": CountWhere(field="cloud_cover", op="<", value=10), "max_gap_days": MaxGap(field="datetime"), }) )The
Fanoutshape means one item-iteration computes all four stats;get_config()serialises the reducer for the run manifest.georeaderis not used in v2. (That’s v2.5.)
Output¶
Adds these four bands to the shared Zarr:
scenes_count (sensor, time, lat, lon) int16
mean_scene_cloud_pct (sensor, time, lat, lon) float32
cloud_free_scene_count (sensor, time, lat, lon) int16
max_gap_days (sensor, time, lat, lon) float32Same grid as v1, so:
ds = xr.open_zarr("data/satellite_climatology.zarr")
gap_v1 = ds["overpasses"] # theoretical
got_v2 = ds["scenes_count"] # observed
catalog_gap = (gap_v1 - got_v2).clip(min=0) # the "reality gap" mapCompute budget¶
Pass 1 (catalog scan):
- S2 1 yr globally: ~1M items, MPC paginated search at ~500 items/page → ~2000 paginated requests, ~30 min wall-clock (mostly network).
- × 5 sensors with very different volumes: MODIS daily × 8 platforms is the heaviest at ~10M items/yr.
- → ~2–6 hours total. Run once, persist.
Pass 2 (aggregation):
- 1M items × ~100 cells/item = ~100M (cell, item) stamps.
- Vectorised by tile, this is ~5–15 min per sensor on a laptop.
- Parallelised over patches: linearly faster.
Storage:
- Per-sensor GeoParquet catalogs: a few GB total.
- Output Zarr (with v1+v2 bands, 5 sensors, 12 months): ~2–3 GB.
UI integration¶
Adds new stats to the dashboard dropdown:
scenes_count— observed catalog densitycloud_free_scene_count— observed cloud-free densitymean_scene_cloud_pct— observed mean cloudmax_gap_days— longest dry spellreality_gap= v1.overpasses − v2.scenes_count — derived band
Same tile-server approach as v1.
H3 alternative (equal-area)¶
For latitudes > 60° the 0.1° lat/lon cell shrinks to ~5 km width while staying 11 km tall — heavy distortion. If equal-area matters (area-weighted statistics, comparing high vs low latitudes):
- Replace the grid with H3 hex bins at resolution 5 (~250 km² each, ~2.0M cells globally).
- All
geopatcher-iteration logic is the same; replaceSpatialRectangularwith aPolygonIntersectiongeometry over the H3 polygon set. - Zarr → Parquet (
pandaswith H3 index column), or sparse Zarr.
Decide this at M3 once we see the latitude-distortion in the v1 maps.
Risks & open questions¶
- Catalog completeness varies by sensor. MPC’s Landsat goes back to 2013, MODIS to 2000+, S2 to 2015. The product window must be clipped per-sensor or padded with nulls.
eo:cloud_coveris scene-wide. A 60%-cloud scene can be 100% clear over your pixel (or vice versa). v2 is documented as a scene-level statistic; v2.5 is the per-pixel answer.- Footprints aren’t pixel-exact — STAC item geometry is the scene outline, which can be slightly larger than actual valid-data extent (especially S2 swath edges). Acceptable for cell-scale stats.
- Re-ingestion drift — collections get reprocessed (new baseline, republished items). Pin the run date and store it in the Zarr metadata so users know what catalog snapshot they’re looking at.
- Rate limits — MPC is generous but not infinite. Throttle the
scan with
pystac-client’smax_itemsper call + a backoff decorator.
Acceptance¶
- All four data vars are written for at least one sensor over a 12-month window at 0.1°.
- Sanity checks:
scenes_count≤ v1overpassescell-wise (catalog ≤ theoretical).mean_scene_cloud_pctmap qualitatively matches a published cloud climatology (e.g., MODIS MOD08 monthly mean) — wet-season AmazonasAtacama Desert.
cloud_free_scene_countis zero for cells withscenes_count = 0.
- Catalog Parquet files are reproducible (re-running the pipeline yields the same row count given the same date pin).
- Reality-gap map renders in the dashboard.
Out of scope¶
- Per-pixel cloud masks (v2.5).
- L1C / L2A choice — we use L2A everywhere it exists; document the
collection per sensor in
sensors.py. - Off-nadir tasking modelling.
- Sub-monthly time bins. (Easy to add later — the schema supports it, just blow up the time axis.)