Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Benchmarks Gallery

UNEP
IMEO
MARS
StatusCompanion gallery to benchmark.md
Reading time~35 min
AudienceAnyone wanting concrete benchmark designs across Ocean, Land, Atmosphere, Remote Sensing, and Mathematical Models domains, instantiated against the GeoStack framework
CompanionReport 15 (framework), Report 14 (pipekit-evaluate), geodata_lifecycle.md (data lifecycle), geostack_vision.md

What this document is

Worked examples. The framework in Report 15 specifies what makes a benchmark benchmarkable; this gallery instantiates the framework across six domains:

  1. Ocean — SSH, SST, SSS, Ocean Colour, BGC

  2. Land — Temperature, Precipitation, Wind, Surface Pressure

  3. Atmosphere — Gases, 3D Wind, Pressure

  4. Remote Sensing — Multispectral-Hyperspectral, Polar-Geo, RTM, Sensor Ops, Multi-Satellite Fusion

  5. Mathematical Models — Emission Estimation (a multi-stage inverse problem)

Each entry follows a consistent template so they’re scannable and comparable:

### Task name

**Carrier transformation.** Which row of the task taxonomy
**Variant.** Deterministic / probabilistic / both
**Why this matters.** Scientific or operational significance
**Reference data.** What's used as truth, by track
**Tracks.** Model-to-reanalysis / -analysis / -observations as applicable
**Baselines.** Mandatory shared baselines
**Metric set.** Across Lens × Unit matrix cells
**Splits.** Block discipline, leakage rules
**Known failure modes.** What models typically get wrong
**Stack mapping.** Which GeoStack pieces implement which part

Read each entry as a draft benchmark contract — not yet pre-registered, but specified concretely enough that it could be.


Domain 1 — Ocean

The most operationally mature domain for ML benchmarking. Strong existing community infrastructure (OceanBench), good reference products (GLORYS, DUACS, OSTIA, ISAS, CCI), but persistent gaps in 3D subsurface, coastal/shelf regimes, and biogeochemistry that the source’s gap analysis identified. Each benchmark below names which gap it addresses.

1.1 — Sea Surface Height (SSH)

Carrier transformation. Gap-filling: Obs (along-track points + SWOT swaths) → Obs (dense grid). Or, for forecast variants, State → State (future).

Variant. Both deterministic (DUACS-style L4 product) and probabilistic (ensemble reconstructions are increasingly common; SWOT calibration introduces honest spread).

Why this matters. SSH drives the surface geostrophic flow; mesoscale eddies dominate ocean variability; SWOT’s wide-swath altimetry has changed the data-sparsity assumptions that decades of OI-based reconstruction were built on. The canonical ML-ocean benchmark — if you can’t do this, the rest is in trouble.

Reference data.

Tracks. All three available. The observations track is the strongest test because the L4 products were built from the same observations — model-to-L4 measures “did you learn the OI”; model-to-obs measures “did you learn the underlying SSH field.”

Baselines.

Metric set.

Splits. Mesoscale decorrelation ~100 km / ~30 days → SpatioTemporalBlockSplit(spatial_block_km=200, temporal_block_days=60). Test set is held-out year. Critical: SWOT and traditional altimetry must be split by mission, not pooled, because they have different sampling characteristics.

Known failure modes.

Stack mapping.

Gap addressed. The classical ocean benchmark; OceanBench SSH Edition is the canonical instance. Our contribution would be a content-addressed contract version of OceanBench-SSH that’s pre-registrable.


1.2 — Sea Surface Temperature (SST)

Carrier transformation. Gap-filling: Obs (cloud-affected sparse grid + in-situ points) → Obs (dense, gap-free grid).

Variant. Primarily deterministic; ensemble L4 products exist but are less standard than for SSH.

Why this matters. SST is the most-observed ocean variable, used everywhere from weather forecasting to ENSO monitoring. Cloud-affected gaps and diurnal variability are the hard problems; the field is “easy” compared to SSH but high-stakes because everyone uses the L4 products downstream.

Reference data.

Tracks. Model-to-L4 (OSTIA / CCI), model-to-observations (drifters, Argo), no clean model-to-analysis distinction (L4 is the analysis).

Baselines.

Metric set.

Splits. Mesoscale + sub-mesoscale → SpatioTemporalBlockSplit(spatial_block_km=100, temporal_block_days=30). Diurnal-cycle splits: time-of-day must be balanced across train/val/test to avoid the model learning “test set is mostly 06Z.”

Known failure modes.

Stack mapping.

Gap addressed. Standard L4 fusion benchmark. The marine-heatwave event detection track is underdeveloped in current benchmarks; this is where the framework’s event-unit + detection-lens combination adds value.


1.3 — Sea Surface Salinity (SSS)

Carrier transformation. Gap-filling + retrieval: very sparse satellite (SMOS / SMAP / Aquarius) + ARGO points → dense grid.

Variant. Primarily deterministic, but uncertainty quantification matters more than for SST because of low signal-to-noise.

Why this matters. SSS drives the haline component of ocean circulation; river plumes, ice melt, and precipitation patterns drive variability; satellite SSS is very low S/N (calibration errors comparable to natural variability in some regions). The hard data-scarcity benchmark in ocean ML.

Reference data.

Tracks. Model-to-CCI (L4), model-to-ISAS (analysis), model-to-Argo (observations). The Argo track is critical because the L4 products are heavily smoothed.

Baselines.

Metric set.

Splits. Long correlation scales → SpatioTemporalBlockSplit(spatial_block_km=300, temporal_block_days=90). By river system: leave-one-major-river-out (Amazon, Mississippi, Ganges-Brahmaputra) tests whether the model learned regional patterns or transferable physics.

Known failure modes.

Stack mapping.

Gap addressed. Underrepresented in current ocean ML benchmarks. The combination of low S/N + sparse in-situ + strong regional patterns is a good test of uncertainty-aware methods.


1.4 — Ocean Colour (OC)

Carrier transformation. Retrieval + gap-filling: L1 radiance → L2 Chl-a / Kd490 → L3 gridded product.

Variant. Deterministic for the canonical Chl-a product; probabilistic variants are research-stage.

Why this matters. Phytoplankton biomass is the base of the marine food web; Chl-a is the most widely-used satellite-derived biogeochemical variable. The retrieval is non-trivial (atmospheric correction is the dominant error source), and “Case 2” coastal waters (with CDOM, sediment, bottom reflectance) break the standard algorithms.

Reference data.

Tracks. Model-to-OC-CCI (multi-mission L4), model-to-HPLC (in-situ ground truth). Model-to-operational track tests whether you beat the operational algorithm.

Baselines.

Metric set.

Splits. Decorrelation varies hugely by region. Open ocean: ~100 km / ~30 days. Coastal: ~10 km / ~3 days. Split by biogeochemical province (Longhurst provinces): leave-one-province-out tests true regional generalization.

Known failure modes.

Stack mapping.

Gap addressed. OC benchmarks are domain-mature but rarely use the full lens × unit matrix — bloom detection (event) and log-distribution comparison (statistic) are typically reported only in research papers, not standardised.


1.5 — Biogeochemistry (BGC)

Carrier transformation. Discretization + Gap-filling: BGC-Argo points (DO, pH, NO3, Chl, irradiance) → 3D gridded fields.

Variant. Probabilistic strongly preferred — calibration drift in BGC-Argo sensors makes uncertainty quantification non-optional.

Why this matters. The largest gap in current ocean ML benchmarks (per the source). BGC-Argo has only recently grown enough to support gridded products; 3D subsurface biogeochemistry is what the next generation of ocean ML needs to tackle.

Reference data.

Tracks. Model-to-WOA (climatological), model-to-BGC-Argo (in-situ, held-out floats), model-to-GLODAP (bottle data). The Argo track with leave-one-float-out is the strongest test.

Baselines.

Metric set.

Splits. Leave-one-float-out is the canonical test — train on most Argo floats, test on the remaining one’s full record. Plus temporal block (some BGC drifts decadally). Plus by region (Atlantic / Pacific / Indian / Southern Ocean).

Known failure modes.

Stack mapping.

Gap addressed. The headline missing benchmark in current ocean ML. The carrier-transformation framing makes the structure clean: it’s a Discretization + Gap-Filling chain applied to multi-variable 3D fields with the depth axis as primary, exactly the case the existing GeoStack underspecifies.


Domain 2 — Land

Land-surface benchmarks have a long meteorological tradition (ERA5-Land, GHCN, IMERG) but ML benchmarks lag weather ones by several years. The opportunity is to bring the same multi-track / multi-lens discipline that weather and ocean benchmarks have developed.

2.1 — 2m Temperature (T2m) / Land Surface Temperature (LST)

Carrier transformation. Gap-filling + Forecast: stations + satellite → dense grid; or, for forecast variants, State → State (future).

Variant. Both. Operational forecasts increasingly probabilistic (ECMWF ENS).

Why this matters. Heatwave prediction has direct mortality consequences; LST is the most widely used remote-sensed land variable; T2m is the headline weather variable for most public-facing products.

Reference data.

Tracks. Model-to-reanalysis (ERA5-Land), model-to-station (GHCN), model-to-LST. All three are valuable; LST track is the hardest because of strong gradients and clear-sky bias.

Baselines.

Metric set.

Splits. SpatioTemporalBlockSplit(spatial_block_km=200, temporal_block_days=30). By climate zone (Köppen-Geiger classes): leave-one-zone-out tests whether the model learned transferable physics. By elevation band for orographic regions.

Known failure modes.

Stack mapping.

Gap addressed. Land surface ML benchmarking is dominated by point-wise RMSE; heatwave detection and physical-constraint lenses would be a meaningful upgrade.


2.2 — Precipitation

Carrier transformation. Forecast + State estimation: gauge + satellite (GPM / IMERG) + radar → dense grid; or, for forecast, State → State (future).

Variant. Strongly probabilistic. Precipitation is the canonical heavy-tailed, intermittent, double-penalty-cursed variable. Deterministic point forecasts are nearly meaningless for actionable use.

Why this matters. Operational weather forecasting’s worst-performing variable; flood prediction’s first ingredient; the canonical “double penalty” failure case. Whatever you think your ML model does, precipitation benchmarks will reveal what it actually learned.

Reference data.

Tracks. All three; the gauge track is the most-trusted truth where coverage is dense (US, Europe, parts of Asia).

Baselines.

Metric set.

Splits. Causal temporal split is mandatory (forecasting); SpatioTemporalBlockSplit(spatial_block_km=500, temporal_block_days=14). Note the large spatial block — synoptic systems are large. By climate zone + by season to expose seasonal biases.

Known failure modes.

Stack mapping.

Gap addressed. Existing precipitation ML benchmarks (IMS, MetNet evaluations) emphasize point-wise + threshold detection but underdevelop the probabilistic + spectral combination. Both should be standard.


2.3 — Wind Speed and Direction

Carrier transformation. Forecast + Gap-filling: stations + scatterometer (over ocean) + ECMWF → dense grid.

Variant. Both deterministic and probabilistic; ensemble wind forecasts critical for renewable-energy applications.

Why this matters. Vector-valued variable with anisotropy and circular statistics; energy applications (wind power) drive the operational stakes; gusts are a high-impact tail behavior.

Reference data.

Tracks. Model-to-ERA5, model-to-station (lots of regional coverage), model-to-ASCAT (ocean), model-to-radiosonde (upper-air).

Baselines.

Metric set.

Splits. SpatioTemporalBlockSplit(spatial_block_km=200, temporal_block_days=10). By topographic complexity: leave-one-region-out among flat/coastal/mountainous classes.

Known failure modes.

Stack mapping.

Gap addressed. Vector-variable benchmarks rarely report proper circular statistics; the framework’s metric-as-operator design lets CircularRMSE ship as a standard implementation.


2.4 — Surface Pressure (MSLP)

Carrier transformation. Forecast + Gap-filling: stations + GPS-RO + reanalysis → dense grid.

Variant. Both; ensemble pressure forecasts inherit from ENS / GEFS.

Why this matters. Pressure tracks synoptic systems (cyclones, blocking); the headline operational forecast variable for medium-range; MSLP errors aggregate into storm-track errors.

Reference data.

Tracks. Model-to-reanalysis, model-to-analysis (operational analysis), model-to-station.

Baselines.

Metric set.

Splits. Synoptic timescales: SpatioTemporalBlockSplit(spatial_block_km=1000, temporal_block_days=14). By ENSO phase for inter-annual variability; by season. Careful around major eruptions (Pinatubo 1991, Hunga Tonga 2022): split so training and test don’t share recovery period.

Known failure modes.

Stack mapping.

Gap addressed. Cyclone tracking as an event-detection benchmark exists in research literature (TempestExtremes, TRACK algorithm comparisons) but isn’t standardized for ML; the framework’s event-unit + detection-lens makes it shippable.


Domain 3 — Atmosphere

The most-mature ML benchmarking domain (WeatherBench, GraphCast / FourCastNet / Pangu / GenCast evaluations). Our contribution is trace gas / chemistry benchmarking, which is structurally similar to weather but operationally less developed.

3.1 — Trace Gases (Methane, CO2, Water Vapor)

Carrier transformation. Retrieval + Gap-filling + Inverse: hyperspectral L1 → L2 column → L3 grid → L4 source estimates.

Variant. Increasingly probabilistic; uncertainty quantification mandatory for emissions attribution.

Why this matters. The headline use case for the MARS / IMEO mission. Multi-stage benchmark: retrieval accuracy at L2, gridding skill at L3, source attribution at L4. Each stage is a separate benchmarkable transformation; the chain is also benchmarkable end-to-end.

Reference data.

Tracks. Model-to-CAMS (reanalysis), model-to-TCCON (column ground truth), model-to-flask (point ground truth), model-to-aircraft (mid-altitude), model-to-controlled-release (the strongest test for attribution).

Baselines.

Metric set.

Splits. Leave-one-source-class-out (urban / oil-and-gas / agricultural / wetland); leave-one-region-out; temporal block. Critical: METEC controlled-release data must be temporally held out because the same emitters are observed many times.

Known failure modes.

Stack mapping.

Gap addressed. The multi-stage / multi-track structure isn’t standardized; each MARS / IMEO study uses bespoke evaluation. Standardizing this as a benchmark contract would be a real community contribution.


3.2 — 3D Wind

Carrier transformation. Forecast + Gap-filling: sondes + scatterometer + AMV + reanalysis → dense 3D grid.

Variant. Both; operational forecasts are deterministic, research increasingly probabilistic.

Why this matters. 3D winds drive transport (chemistry, dust, aerosols); upper-level winds drive jet-stream variability that controls extreme weather; the bottleneck for atmospheric chemistry forecasts.

Reference data.

Tracks. Model-to-ERA5, model-to-sonde, model-to-AMV, model-to-Aeolus.

Baselines.

Metric set.

Splits. SpatioTemporalBlockSplit(spatial_block_km=500, temporal_block_days=14). By altitude band (boundary layer / free troposphere / stratosphere) for stratified evaluation.

Known failure modes.

Stack mapping.

Gap addressed. 3D evaluation is reported per-pressure-level in research papers but rarely as a standardised benchmark; the framework’s depth-axis support (from geodata_lifecycle.md) makes this clean.


3.3 — Atmospheric Pressure (Z500, MSLP, Tropopause)

Carrier transformation. Forecast: 3D state → 3D state (future).

Variant. Both; ENS / GEFS provide ensemble references.

Why this matters. Z500 is the canonical NWP forecast metric (“ACC at Z500” is the headline number for medium-range forecasts); MSLP tracks synoptic systems; tropopause height connects tropospheric and stratospheric dynamics.

Reference data.

Tracks. Model-to-reanalysis, model-to-analysis (higher res operational), model-to-radiosonde.

Baselines.

Metric set.

Splits. Synoptic timescales: SpatioTemporalBlockSplit(spatial_block_km=1000, temporal_block_days=14). By ENSO phase, by NAO phase for inter-annual variability.

Known failure modes.

Stack mapping.

Gap addressed. Already well-instrumented (WeatherBench 2 is the standard). The framework’s contribution here is content-addressable contracts — WeatherBench 2 is shared via documentation; making it a hashable artifact would close the verification gap.


Domain 4 — Remote Sensing

Cross-instrument benchmarks. Less mature as standardised ML benchmarks than weather / ocean, but operationally critical for any multi-sensor product.

4.1 — Multispectral ↔ Hyperspectral

Carrier transformation. Cross-instrument harmonization, super-resolution: hyperspectral (high spectral, low spatial / temporal) ↔ multispectral (lower spectral, higher spatial / temporal).

Variant. Deterministic primarily; uncertainty propagation in research stage.

Why this matters. Hyperspectral data is information-rich but coverage-poor; multispectral is the operational workhorse. Cross-instrument fusion is what unlocks both worlds; the unsolved hyperspectral super-resolution problem has direct applications in agriculture, water quality, and atmospheric chemistry.

Reference data.

Tracks. Model-to-coincident-hyperspectral (where simultaneous overflights exist), model-to-airborne (gold standard), model-to-downstream-task (does the predicted hyperspectral improve a known task?).

Baselines.

Metric set.

Splits. Leave-one-scene-out: each PRISMA / EnMAP scene is a unit. Temporal block (some surfaces evolve seasonally). By land-cover class for transferability.

Known failure modes.

Stack mapping.

Gap addressed. No standard cross-instrument benchmark exists in the community despite the operational need. The carrier-aware framing (geocatalog indexes both Sentinel-2 and PRISMA; patcher co-registers) makes this practical to ship.


4.2 — Polar-Orbiting ↔ Geostationary

Carrier transformation. Cross-platform fusion, temporal super-resolution: polar (high spatial, low temporal) + geostationary (low spatial, high temporal) → both high.

Variant. Deterministic; probabilistic variants emerging for nowcasting.

Why this matters. Polar orbiters give global high-spatial coverage at low cadence; geostationary gives high-cadence sub-hemisphere coverage at lower spatial resolution. Operational meteorology depends on the fusion; nowcasting (0-2h forecasts) is increasingly driven by this combination.

Reference data.

Tracks. Model-to-coincident-overpass (where polar and geo see the same scene), model-to-in-situ (where ground stations validate both).

Baselines.

Metric set.

Splits. By season (geometric viewing-angle effects); by latitude band (polar coverage gets denser with latitude; geo coverage drops); by sensor pair.

Known failure modes.

Stack mapping.

Gap addressed. Operational meteorology centers do this internally but rarely as a public benchmark. Standardizing it would help the NWP and nowcasting communities.


4.3 — Radiative Transfer Model (RTM) Emulation

Carrier transformation. Forward simulation / emulation: atmospheric + surface state → simulated radiances.

Variant. Deterministic; with Jacobian (gradient) requirements for inverse problems.

Why this matters. RT calculations are the inner loop of nearly every retrieval; they’re expensive (LBLRTM minutes per scene); neural emulators offering 100× speedup with maintained accuracy + differentiability unlock differentiable retrievals (per Report 5 on pipekit-jax).

Reference data.

Tracks. Model-to-LBLRTM (accuracy), model-to-MODTRAN (operational), model-to-actual-observations (the strongest test — does emulating RT closely enough that the retrieval still works?).

Baselines.

Metric set.

Splits. By atmospheric state regime (clear / aerosol / cloudy); by surface type (ocean / vegetation / desert / snow); by gas concentration range. Train on standard distributions, test on extremes (high methane, high aerosol load).

Known failure modes.

Stack mapping.

Gap addressed. RT emulator benchmarks exist in research papers (RTTOV-NN, RTNN) but rarely with the Jacobian-fidelity + out-of-distribution evaluation that operational retrievals need.


4.4 — Satellite Sensor Operators (SRF, PSF, Noise)

Carrier transformation. Instrument simulation / inverse instrument simulation: high-resolution truth → instrument-degraded observation; or inverse.

Variant. Deterministic for forward instrument simulation; probabilistic for inversion.

Why this matters. The instrument is the boundary between the physical world and the data. Errors here cascade through every downstream product. Sensor-operator emulators that correctly model SRF (spectral response function), PSF (point spread function), and noise statistics are what enable OSSEs (Observing System Simulation Experiments) and counterfactual analyses (“what if we had this sensor?”).

Reference data.

Tracks. Simulator-to-actual-instrument (degrade truth, compare to actual L1); model-to-pre-launch-spec (validate against vendor characterization).

Baselines.

Metric set.

Splits. By spectral band; by detector array element (different detectors have different responses); by mission phase (pre-launch / commissioning / nominal / end-of-life).

Known failure modes.

Stack mapping.

Gap addressed. Sensor operator benchmarks are typically internal to space agencies; making them public benchmarks would benefit the OSSE community.


4.5 — Multi-Satellite Fusion (NEW)

Carrier transformation. Cross-sensor fusion: multiple sensor streams (e.g., MODIS + geostationary; Sentinel-2 + Sentinel-3; PRISMA + Sentinel-2) → unified product.

Variant. Both deterministic and probabilistic; ensemble fusion is increasingly used for uncertainty propagation.

Why this matters. No single sensor gives everything you want; multi-sensor fusion is what gives operational products their robustness. A canonical benchmark for “modern remote sensing” as it’s actually practiced. Most operational L4 products (OSTIA, OC-CCI, IMERG, ECV products) are multi-sensor fusions internally.

Reference data.

Tracks. Model-to-operational-fusion (does your fusion beat the operational product?); model-to-single-sensor (where overlap exists, does multi-sensor add value?); model-to-in-situ (the strongest test).

Baselines.

Metric set.

Splits. Leave-one-sensor-out: train fusion on N-1 sensors, test on full N (does the model still produce a sensible product when one sensor is missing?). Temporal block. By sensor-combination availability (some years have more sensors than others).

Known failure modes.

Stack mapping.

Gap addressed. Multi-sensor fusion is how operational products work but isn’t standardised as an ML benchmark. The user’s specific example (MODIS + geostationary) is one of the most operationally important: high-cadence diurnal-cycle resolution combined with high-spatial-resolution snapshots. The framework’s catalog-and-matchup machinery is purpose-built for this case.


Domain 5 — Mathematical Models

The benchmarks here are different in shape: they test a chain of dependent sub-tasks, not a single transformation. Methane emission estimation is the canonical example — five linked steps from radiative transfer to total emission.

5.1 — Emission Estimation (Multi-Stage Inverse Problem)

This is a meta-benchmark composed of five sub-benchmarks. Each sub-stage is independently benchmarkable; the full chain is also benchmarkable end-to-end.

   Stage 1: Radiative Transfer Model (RTM)
   ──────────────────────────────────────
   Carrier: atmospheric+surface state → radiances
   (See section 4.3 above; RT emulation is the underlying capability)
   
            ▼
   Stage 2: Plume Simulation (Forward Model Emulation)
   ───────────────────────────────────────────────────
   Carrier: source emission rate + meteorology → downwind concentration field
   References: HYSPLIT, FLEXPART, WRF-Chem; controlled-release ground truth
   
            ▼
   Stage 3: Probability of Detection (POD Curve)
   ─────────────────────────────────────────────
   Carrier: scene + emission characteristics → probability of being detected
   References: METEC controlled-release campaigns (known emission rates,
              measured detection rates per overpass)
   
            ▼
   Stage 4: Source Persistency
   ────────────────────────────
   Carrier: source-history time-series → persistence probability
   References: long-term monitoring of known persistent emitters
              (oil and gas facilities, landfills, agricultural sources)
   
            ▼
   Stage 5: Total Emission
   ─────────────────────────
   Carrier: detections + persistencies + durations → mass flux (kt/year)
   References: bottom-up inventories (EDGAR, EPA GHGRP), tall-tower
              regional flux estimates, controlled-release totals

Why this matters. The headline operational case for MARS / IMEO. Each stage has its own ML opportunity; the chain has compound uncertainty that matters operationally. The benchmark structure forces evaluation at each stage and end-to-end, which is what regulatory applications require.

Reference data, by stage:

Tracks. Model-to-controlled-release is the gold standard (Stages 1-3); model-to-inventory for Stages 4-5 (with the caveat that inventories themselves have large uncertainties); model-to-aircraft for spot checks.

Baselines.

Metric set.

Splits.

Known failure modes.

Stack mapping.

Gap addressed. The canonical end-to-end benchmark for MARS / IMEO. Multi-stage benchmark contracts aren’t currently a thing in geophysical ML; the framework’s content-addressed contract + multi-track evaluation pattern + cascade-uncertainty discipline together make this shippable. This is probably the single most important benchmark in this gallery from a real-world impact perspective.


Cross-domain observations

A few patterns visible across the six domains:

Pattern 1 — The three tracks are almost always available

Of 22 benchmarks across the gallery, only 1-2 lack a model-to-observations track. Reanalysis + analysis + observations is broadly available across geophysical domains. The OceanBench multi-track pattern is generalizable.

Pattern 2 — The hardest splits are platform-based, not temporal

Many benchmarks rely on LeaveOnePlatformOut (one Argo float held out; one METEC campaign held out; one station network held out). Temporal blocks alone are not enough for spatial-correlation-rich data. LeaveOnePlatformOut is more important than the source acknowledges.

Pattern 3 — Probabilistic is mandatory for some variables, optional for others

Precipitation, BGC, emission attribution: probabilistic is mandatory. SSH, SST: deterministic is the standard. The variant decision is data-driven, not a framework choice. Benchmark contracts should declare which is required.

Pattern 4 — Event detection is undervalued in current practice

Every domain in the gallery has at least one event-detection metric that adds operational value. Heat waves, eddies, blooms, blocking, plumes, cyclones — these are what end-users care about. Field-only evaluation systematically hides the operationally important behavior.

Pattern 5 — Cascade benchmarks (chains of sub-tasks) need their own pattern

Stage-1-to-Stage-5 emission estimation is structurally different from single-step benchmarks. The framework needs a MultiStageEvaluation pattern (mentioned in section 5.1) that aggregates per-stage reports and computes cascade-uncertainty estimates. This is missing from Report 14 and worth adding to v0.2 of pipekit-evaluate.


Recommendations summary

For each domain, the framework primarily needs the same things:

  1. A content-addressed contract (Report 15 framework piece)

  2. Leakage-aware splitters (the platform-leave-out variants especially)

  3. Standard baselines as registered operators

  4. Multi-track evaluation as a built-in pattern

  5. Event detectors in xr_toolz.events (most domains have at least one critical event type)

  6. Probabilistic metrics as first-class peers to point-wise metrics

  7. Profile / trajectory catalogs for in-situ-heavy benchmarks (BGC, sondes, drifters)

These map onto the Report 14 + Report 15 + geodata_lifecycle.md recommendations cleanly. No new packages required; the existing scoping reports cover the needs.

Two new patterns surfaced by this gallery that should be added to pipekit-evaluate v0.2:

Estimated additional work: ~3-5 days each. Both are small compared to the gallery’s value.


Worth being explicit. This is not:

The gallery’s purpose is to make the framework concrete by working through 23 realistic benchmark designs. If a real benchmark were to adopt the framework, this gallery is the template to adapt — not a finished product to use.